A  Big  RISC 

Richard  A.  Btomteik,  Capt.,  USAF 
Masters  Project  Final  Report 

Computer  Science  Division 

Department  of  Electrical  Engineering  and  Computer  Science 
University  of  California,  Berkeley 
Berkeley,  CA  94720 


ABSTRACT 

Big  RISC  (BRLSC)  is  a  high-speed  CPU  designed  with  lOOK  ECL  logic  and 
based  on  the  RISC  I  architecture.  Design,  performance,  and  coot  of  BRISC  is 
presented.  Performance  is  shown  to  be  better  than  high  end  mainfr^es  such  as 
the  IBM  3081  and  Amdahl  470V/8  on  integer  benchmarks  written  in  C,  Pascal 
and  LISP.  The  cost,  conservatively  estimated  to  be  |132,400,  is  about  the  same 
as  a  hi^  end  minicomputer  such  as  the  VAX-11/780.  BRISC  has  a  CPU  cycle 
time  of  46  ns,  providing  a  RISC  I  instruction  execution  rate  of  greater  than  15 

MIPS. 

BRISC  is  designed  with  a  Structured  Computer  Aided  Logic  Design  System 
(SCALD)  by  Valid  Logic  Systems.  An  evaluation  of  the  utility  of  SCALD  for 
computer  design  is  also  included. 


July  18,  1983 


■n*  vork  itported  kema  »«  «pported  is  put  by  Defea.e  Adv^e*  Roeirck  Pieject.  Agency  (DoD)  ^PA 
Older  No  SSOS  Mositored  by  N»ti1  EJectrosie  Syrtem  Commsad  lader  Coatract  No.  NOOOSMl-K^Oasi,  aad 
the  U.S.  Departmeat  ol  Eatrtj  aader  Coatnrt  DB-ATOV76SF00034,  Project  A«ieemeat  DE-AS03-79ER103S8. 


Report  Documentation  Page 


Form  Approved 
0MB  No.  0704-0188 


Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 
VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  0MB  control  number. 


1.  REPORT  DATE 

18  JUL  1983 

4.  TITLE  AND  SUBTITLE 

A  Big  RISC 


6.  AUTHOR(S) 


2.  REPORT  TYPE 


7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

University  of  California  at  Berkeley, Department  of  Electrical 
Engineering  and  Computer  Sciences, Berkeley, CA, 94720 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 


3.  DATES  COVERED 

00-00-1983  to  00-00-1983 

5a.  CONTRACT  NUMBER 

5b.  GRANT  NUMBER 

5c.  PROGRAM  ELEMENT  NUMBER 

5d.  PROJECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

8.  PEREORMING  ORGANIZATION 
REPORT  NUMBER 


10.  SPONSOR/MONITOR’S  ACRONYM(S) 

11.  SPONSOR/MONITOR’S  REPORT 
NUMBER(S) 


12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

14.  ABSTRACT 

Big  RISC  (BRISC)  is  a  high-speed  CPU  designed  with  lOOK  ECL  logic  and  based  on  the  RISC  I 
architecture.  Design,  performance,  and  cost  of  BRISC  is  presented.  Performance  is  shown  to  be  better  than 
high  end  mainframes  such  as  the  IBM  3081  and  Amdahl  470V/8  on  integer  benchmarks  written  in  C, 
Pascal  and  LISP.  The  cost,  conservatively  estimated  to  be  $132,400  is  about  the  same  as  a  high  end 
minicomputer  such  as  the  VAX-11/780.  BRISC  has  a  CPU  cycle  time  of  46  ns,  providing  a  RISC  I 
instruction  execution  rate  of  greater  than  15  MIPs.  BRISC  is  designed  with  a  Structured  Computer  Aided 
Logic  Design  System  (SCALD)  by  Valid  Logic  Systems.  An  evaluation  of  the  utility  of  SCALD  for 
computer  design  is  also  included. 

15.  SUBJECT  TERMS 


16.  SECURITY  CLASSIEICATION  OF: 

17.  LIMITATION  OF 

18.  NUMBER 

19a.  NAME  OF 

ABSTRACT 

OF  PAGES 

RESPONSIBLE  PERSON 

a.  REPORT 

unclassified 

b.  ABSTRACT 

unclassified 

c.  THIS  PAGE 

unclassified 

Same  as 
Report  (SAR) 

89 

Standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


Richard  A.  Blomseth 

Author 


A  Big  RISC 


Title 


RZSEAROi  PROJECT  . 


Submitted  to  the  Department  of  Electrical  Engineering  and 
Computer  Sciences,  University  of  California,  Berkeley, 
to  partial  satisfaction  of  the  requirements  for  the  degree 
of  Master  of  Sciences,  Plan  II. 


Approval  for  the  Report  and  Comprehensive  Examination: 
Committee;  TD'  A  ■  ,  Research  Advise 


IR  ,  D.te 

,  Second  Reader 

17 

- T" 

17  I9B  .  Date 

Table  of  Contents 


1.  Introduction . 

1.1.  RISC  I  Architecture  - 

1.1.1.  RISC  I  Instruction  Set - 

1.1.2.  Register  File  . . . . 

1.2.  Berkeley  RISC  Implementntkjns 

2,  BRISC  Design  . . 

2.1.  Pipeline  Timing  . 

2.2.  BRISC  Hardwsre  Design  - 

2.2.1.  Ditn  Path  . . 

2.2.1.1.  PCs  AND  ADDRESS - 

2.2.1.2.  CACHE  . 

2.2.1.3.  REGISTER  FILE  . . . 

2.2.1.4.  ALU  PATH  . 

2.2.1.5.  SHIFTER  PATH . 

2.2.1.6.  SPECIAL  REGISTERS . 

2.2.1.7.  RESULT  LATCH  . 

2.2.2.  Control . . . 

2.2.2.1.  Control  Hardsrare . 

2.2.2.2.  Control  ‘Microcode’  . . 

2.2.2.3.  Support  Processor . 

3.  Performance  Analysis  . . 

3.1.  BRISC  Performance . 

3.1.1.  CPU  Speed  . - . 

3.1.1.1.  CPU  Cycle  Time  . 

3.1.1.2.  CPU  Cycles  Per  Instruction 

3.1.1.3.  Design  Changes  . 

3.1.2.  Memory  Speed . . 

3.2.  Benchmark  Comparisons  . . 

4.  Cost . . . 

6,  Closing  Remarks . 

5.1.  Comments  on  SCALD . . 

5.1.1.  SCALD  Speed  . 

5.1.2.  Graphics  E)ditor . . 

5.1.3.  Compiler  . 

5.1.4.  Timing  Verifier . 

5.1.5.  Simulator  . 

6.1.8.  Post  Processor . - . 


1 

1 

2 

2 

3 

4 

4 
5- 

5 
5 
8 
8 
0 
9 

9 

10 
10 
10 
10 
11 
11 
12 
12 
12 
14 
14 
17 
20 
23 
26 
26 
26 
27 

27 

28 

29 

30 


5.2.  Lemons  Learned - - - 

5J.  Fntnre  Work 

8.  Conclnsioa  - - - - - - — 

7.  Acknowledgements  — 

Appendix  A.  BRISC  Instruction  Set 
Appendix  B.  CommerciaJ  Cache  Summary 
Appendix  C.  Effective  Cycle  Time  Calculation 
Appendix  D.  BRISC  Parts  List 
Appendix  E.  BRISC  Drawings 
Appendix  F.  DAPL  Microcode  Listings 


A  Big  RISC 

Richard  A.  Blomtetk,  Capt.,  USAF 
Masters  Project  Final  Report 


Compoter  Science  Dhrisiott 

Department  of  Electrical  Engineerins  and  Computer  Science 
University  of  California,  Berkeley 
Berkeley,  CA  94720 


1.  Introduction 

Even  though  microprocessors  are  now  available  with  the  power  of  previous  years’  minicom-. 
puters,  and  are  expected  to  get  faster,  there  will  always  be  problems  where  even  the  fastest 
microprocessors  are  not  fast  enough.  This  paper  describes  the  design  of  BRISC  (Big  Reduced 
Instruction  Set  Compoter).  a  high  performance  32.bit  processor  designed  with  discrete  lOOK  ECL 
logic.  BRISC  uses  the  same  concepts  used  to  speed  up  the  RISC  I  microprocessor;  these  con¬ 
cepts  include  a  simplified  instruction  set  and  overlapping  register  windows. 

BRISC  is  designed  with  a  Structured  Computer-Aided  Logic  Design  (SCALD)  system.  “ 
Most  of  the  CPU  design  has  been  entered  into  the  SCALD  system  and  post  processed  for  physical 
design  errors.  The  worst  case  logic  path  has  been  timing  verified  using  SCALD  to  approximate 
the  CPU  cycle  time.  Some  simulation  has  been  done  (about  70%  of  the  design)  to  verify  func¬ 
tional  correctness  of  the  CPU  design. 

Thb  paper  describes  the  design,  performance  and  cost  of  BRISC.  Performance  and  cost  are 
compared  to  other  high  performance  computers  to  evaluate  the  usefulness  of  the  RISC  architec¬ 
ture  for  high  performance  computer  design.  The  rest  of  this  section  describe,  the  RISC  I  architec- 
tare  and  the  Berkeley  RISC  implemenUtioiis.  RISC  I  architecture  is  described  because  BRISC 
uses  the  RISC  I  architecture  at  the  instruction  set  level. 

la.  RISC  I  ArchHweturu 

Fundamental  to  RISC  wchitectures  is  the  concept  that  frequent,  time  consummg  tasks 
should  be  done  as  fast  as  possible,  and  infrequent  tasks  may  be  done  more  slowly.  RISC  I  is 
designed  to  execute  high  level  languages  such  as  C  and  Pascal  efficiently,  so  the  RISC  I  instruc- 
tion  set  concentrates  on  the  most  time  consuming  operations  of  high  level  languages.  Two  time 
consuming  operations  optimized  by  the  RISC  I  instruction  set  are:  (1)  procedure  calls  and  returns, 
that  account  for  40?5  of  the  time  spent  in  traditional  architectures;  and  (2),  data  references  to 
local  scalar  variables  and  constants,  that  account  for  over  60%  of  data  references.  RISC  I  a 


.  2- 


dcaigned  for  inteser  programs;  floating  point  operations  are  done  in  software  and  are  therefore 
significantly  slower  than  integer  operations. 

The  remainder  of  this  section  provides  a  brief  introduction  to  the  RISC  I  architecture.  For 
more  information,  the  reader  is  referred  to  the  RISC  I  Principles  of  Operaiion. » 

RISC  I  has  two  significant  departorea  from  traditional  computer  architectures.  First  is  a 
small  and  simple  instruction  set;  second  is  a  large  register  file  consisting  of  overlappmg  register 
windows. 

1.1.1.  Base  I  Instruction  Set 

The  RISC  I  instruction  set  consists  of  31  instructions  and  supports  five  addressing  modes. 
Three  address  modes  -  register,  immediate  and  indexed  -  are  supported  directly  by  the  instruc¬ 
tion  set.  Two  addressing  modes  -  absolute  and  register  indirect  -  may  be  synthesized  from 
indexed  addressing.  Only  the  LOAD  and  STORE  instructions  access  the  main  memory,  aU  other 
instructions  read  and  write  registeia.  Most  operations  read  two  registers  and  write  mto  a  third. 
Because  of  the  register-to-register  orientation  of  the  instruction  set,  RISC  processors  complete  an 
instruction  nearly  every  memory  cycle.  LOAD  and  STORE  instructions  are  the  only  exceptions  and 
require  one  or  two  extra  cycles. 

RISC/E  added  23  instructions  to  the  RISC  I  instruction  set  to  support  relative  loads  and 
stores,  increased  access  to  the  special  registers,  and  support  for  the  virtual  memory.  BRISC  uses 
the  RISC/E  instruction  set.  The  BRISC  instruction  set  is  included  in  .\ppeadix  A. 

1.1.3.  Register  File 

RISC  I  relies  on  multiple  sets  or  winJows  of  Sfi-bit  registers  to  speed  up  calls,  returns,  and 
access  to  local  variables.  This  collection  at  register  windows  is  known  as  the  register  file. 

BRISC  contains  a  register  file  consisting  of  128  32-bit  registers.  Sixteen  of  these  registers 
(the  register  window)  are  available  to  a  procedure  at  any  one  time,  and  a  new  set  is  allocated  for 
a  CALL  and  deallocated  for  a  RETURN.  Memory  accesses  are  not  required  to  save  and  restore  the 
return  address  or  other  registers  as  required  by  conventional  architectures.  Furthermore,  move¬ 
ment  of  data  Is  not  required  for  parameter  passing  because  allocation  and  deallocation  of  registers 
for  CALL  and  RETURN  is  done  with  overlapping  windows.  Overlapping  register  windows  usually 
allow  the  program’s  complete  activation  stack  to  remain  in  registers,  so  that  the  CPU  references 
fast  registers,  rather  than  slow  main  memory. 

BRISC  has  two  sets  of  128  registers,  one  set  b  active  in  user  mode,  the  other  is  active  m 
system  mode.  RISC  I  registers  spiU  into  memory  if  the  128  register  limit  is  exceeded.  Tea  of  the 
16  registers  in  a  register  window  are  shared  by  all  windows  in  a  register  file  and  are  known  as  the 


-3- 


giobal  regitten. 

1.3.  Berkeley  RISC  ImplemenUtlon* 

Four  RISC  processors  have  been  designed  at  UC  Berkeley  since  the  project  was  started  in 
1S80.  Two  of  them  are  single  chip  NMOS  desUn.  (RISCs  I  and  B).  and  two  are  discrete  logic 

ECL  designs  (RISC/E  and  BRISC). 

RISC  I 

RISC  I  is  the  first  RISC  designed  at  Berkeley.  It  was  designed  by  five  students  in  fl  months. 
The  44,500  transistor  chip  was  fabricated  using  4  micron  NMOS  technology,  and  is  10.3  x  7.74 
mm  (406  X  305  mils).  Through  the  use  of  computer  aided  design  tools,  RISC  I  worked  on  first  sil¬ 
icon.  Even  though  the  chips  work,  they  do  not  meet  their  design  goal  for  speed.  The  design  goal 
for  RISC  I  was  a  400  ns  cycle;  RISC  I  runs  with  a  2000  ns  cycle.  Even  at  2000  ns  RISC  I  runs  C 
programs  faster  than  an  8  MHs  68000. 

RISC  n 

RISC  n  was  designed  by  two  graduate  students  in  two  years  and  is  a  more  refined  processor 
than  RISC  I.  Building  on  the  success  of  RISC  I  in  using  computer  aided  design  tools  to  design 
chips  that  were  logically  correct,  RISC  H  used  a  newer  set  of  tools  that  included  a  program  to 
verify  the  speed  of  the  chip  (Crystal).  The  longer  design  cycle  and  better  tools  resulted  in  a 
better  circuit  design  and  working  chips  with  75%  more  registers  (138),  25%  less  area  (10.3  x  5.8 
mm),  and  four  times  the  speed  (500  as  cycle  time)  of  RISC  L 

RISC  n  uses  a  three  stage  pipeline  as  opposed  to  the  two  stage  pipeline  of  RISC  I.  A  simi¬ 
lar  three  stage  pipeline  was  subsequently  used  by  RISC/E  and  BRISC.  The  three  stage  pipeline 

is  discussed  in  Section  2.1. 

RISC/E 

RISC/Extended  (RISC/E)  was  a  paper  design  of  a  lOK  ECL  CPU  and  cache  also  started  in 
1980  to  see  if  the  RISC  I  concepts  could  be  appUed  to  a  high  performance  discrete  logic  CPU.  « 
The  result  is  the  design  for  a  fast  CPU  with  a  75  ns  cycle  time  consisting  of  450  lOK  ECL  parts. 
The  cache  also  has  a  cycle  time  of  75  ns  and  consists  of  550  lOK  ECL  parts. 


-  4- 


BRISC 

Bi«  RISC  (BRISC)  continaes  the  work  started  by  RISC/E  to  verify  the  applicability  of 
RISC  concepU  to  hish  performance  computer  design.  Jeff  Deutach  and  the  anthor  designed  the 
first  version  of  BRISC  during  the  Winter  quarter  of  1«83.  BRISC  started  as  a  translation  of  the 
lOK  ECL  design  of  RISC/E  to  VXK  ECL.  Minor  changes  were  made  throughout  the  design  Ur 
accomodate  differences  between  the  llK  and  lOOK  parts.  The  ALU,  shifUr,  program  counters, 
and  control  logic  were  redesigned  for  increased  speed  and  decreased  parts  count.  Timmg  was  kept 
nearly  identical  to  RISC/E. 

The  resulting  design  from  the  Winter  quarter  consisted  of  606  parts.  The  Wgh  part  count 
was  due  primarily  to  the  lack  of  buffers  in  the  SCALD  Ubrary  to  prevent  loading  violations; 
instead  parts  were  replicated  to  obtain  the  desired  drive.  Simulation  was  completed  on  the  ALU, 
shifter  and  program  counters,  and  post  processing  was  started.  Timing  verification  was  not  per- 
formed  because  of  bugs  in  the  timing  verifier. 

The  author  continued  work  on  BRISC  daring  the  Spring  quarter  of  1983.  BRISC  was 
redesigned  to  use  buffers  that  were  added  to  the  SCALD  Ubrary.  and  control  was  redesigned  to 
require  fewer  bits.  The  resulting  design  uses  332  parts.  The  BRISC  equivalent  of  microcode  was 
written  for  the  arithmetic  instructions,  and  simulation  was  started  for  the  entire  design.  The  tim- 
ing  verifier  became  available  from  Valid,  so  enough  of  the  design  was  verified  to  calculate  the 
cycle  time.  Poet  processing  was  completed  for  the  entire  design. 

BRISC  is  faster  than  RISC/E  (-47  as  vs.  75  ns)  with  fewer  chips  (332  vs.  550).  lOOK  ECL  a 
a  faster  logic  family  than  IQK  ECL,  but  was  not  widely  available  when  the  RISC/E  project  was 
started  (1980).  The  lOOK  ECL  parts  tend  to  have  more  functionality  than  lOK  ECL  parts,  but 
this  functionality  is  bought  U  the  expense  of  more  pins  per  package.  Nearly  all  lOOK  ECL  parts 
have  24  pins,  whereas  most  lOK  ECL  parts  have  16  pins. 

The  remainder  of  this  paper  describes  the  design,  performance  and  cost  of  BRISC. 

S.  BRISC  Dealcn 

BRISC’s  design  is  based  on  the  three  stage  pipeUne  developed  by  Lloyd  Dickman  that  was 
nsed  by  RISC  II  and  RISC/E.  This  section  describes  the  pipeline  timing  and  the  hardware  design 

of  BRISC. 

J.1-  Pipeline  Timing 

BRISC  pipeline  timing  will  only  be  summarized  here,  a  more  complete  description  may  be 
found  in  the  RISC/E  design  study.  «  BRISC  uses  a  two  phase,  noi«verlapping  clock.  Most 
instructions  take  three  cycles,  or  six  phases,  to  execute.  The  CPU  is  pipeUned  so  that  three 


-5- 


inBtractions  are  being  acted  on  simultaneonsly,  allowing  instmctiona  to  issne  and  complete  at  a 
rate  approaching  one  per  cycle.  Fignre  1  deacribea  the  six  phaae.  of  a  typical  inatmction,  and 
Figure  2  shows  three  instmctiona  passing  through  the  pipeline.  The  pipeUne  is  designed  to  make 
n,«cimum  use  of  key  resource.  -  the  cache,  register  Ble.  and  ALU  -  and  to  minimiae  contention- 
for  these  resources.  The  pipeline  is  also  designed  to  not  require  pipeline  flushes,  aU  instmctiona 
entering  the  pipeline  go  through  to  completion  unless  externally  intermpted. 

Most  instmctions  foUow  the  pipeline  timing  shown  in  Figures  1  and  2  so  that  no  complex 
pipeline  interlocks  are  required.  Even  instructions  that  receive  special  treatment  in  traditional 
pipelines  -  such  wi  CALL,  JUW>,  and  RETURN -do  not  affect  pipeline  timing  in  BRISC  and  foUow 
the  same  basic  three  cycle  timing. 

Only  the  LOAD  and  STORE  instmctiona  require  more  than  three  cycles  to  execute.  One  extra 
cycle  is  required  for  the  memory  access;  a  second  extra  cycle  is  required  for  8-  and  Iff-bit  loads 
and  stores  to  align  the  memory  data  with  the  register  file  using  the  shifter.  The  second  extra 
cycle  is  not  required  for  LOAD  WORD  and  STORE  WORD  because  32.bit  words  are  always  aUgned 

with  memory. 

W.  BRISC  Hardware  Design 

Figure  3  is  the  top  level  SCALD  drawing  for  BRISC  (Appendix  E  contains  the  complete  set 
of  SCALD  drawings  for  BRISC).  The  BRISC  CPU  contains  the  ten  modules  shown  in  Figure  3. 
The  control  logic  is  in  the  CONTROL  module  and  the  other  nine  modules  constitute  the  daU 
path.  Design  of  the  BRISC  data  path  and  control  is  described  in  the  following  sections. 

JJ.1.  Datapath 

Most  data  flow,  from  left  to  right  in  Figure  3,  starting  with  the  inatmction  ^Idress  being 
generated  in  PC$  AND  ADDRESS  and  ending  with  the  result  in  the  RESULT  LATCH  for  subse¬ 
quent  writing  b«:k  to  the  REGISTER  FILE.  The  foUowing  paragraphs  describe  each  of  the 

modules  in  the  data  path. 

SJ.1.1.  PCs  AND  ADDRESS 

The  top  half  of  the  PCs  AND  ADDRESS  drawing  in  Appendix  E  contains  the  program 
counters  (PCs).  BRISC  has  three  PCs:  the  Nest  PC  (PCN),  Current  PC  (PCC).  sad  Lut  PC 
(PCL).  The  PCN  is  incremented  at  the  start  of  the  inatmction  cycle,  used  to  fetch  the  next 
iMWictio.,  ««pt  f«  jump.,  b«om«  tl.  PCC  U  Ih.  ot  U..  .?.!«.  Tbe 

PCC  hoU.  th.  .  Jp.  ot  tk,  pro»m»  po.nl«  tor  the  ..iteptlr  execplips  metrectio.  ud  the  PCL 
holds  the  ‘last’  value  of  the  PC,  useful  for  intermpU  and  exceptions. 


-6- 


era 

fe: 

-E  1: 

^CH 

CYCLE  2: 
EXECUTE 

CYCLE  3: 

WRITE 

♦2 

*3 

*4 

*b 

♦« 

start  R«ad 

Read 
Compkta: 
Start  D*eed* 

Oeced* 
Complete: 
Read  Rf 

E^ecatka 

Complete 

Safer 

Reeait 

WriURP 

T  t 

PCN— PCN+-4  L*UkR««H 

TIUB^ 

flGURB  1.  BASIC  TlMINa  CYCL2.  Ti«  tkrw  cjcla  <d  la  iutnetiea,  aad  camfpoadiaclT  ‘ki»* 
ita^et  ft  tk«  «cee«ljot  pip*,  u»  exiled  Ftiek,  SMttU,  xad  WriU.  At  tk*  ttart  ot  pkaa*  1,  tk*  Neat 
Procraa  Caaater  (PCN)  h  {ated  oato  tk«  eaek*  addrea*  baa  aad  by  tk*  «*d  of  pkaa*  S  tk* 
{astraetioa  kaa  b«ea  decoded,  la  pkaa*  >,  tk*  taro  operaad  retiaten  are  read  Rroa  tk*  reciter 
lie  aad  tk*  Aritkmetie  Lofle  Ualt  (ALU)  aad  tkifter  operadoaa  are  aeleeted.  By  tk*  ead  of 
phAa*  4,  tk*  ALU  aad  >kift  operadoa*  are  eampiea.  la  pkaa*  b,  tk*  reaalt  of  tk*  data-patk 
operatioa  ia  baffered  aad  tka  dfa  opdoaally  extaaded.  Flaafly,  la  pkaa*  b  tk*  reaait  la  vrittea 
back  lato  tk*  reclater  tie. 


1  FETCH 

EXECUTE 

WniTE 

♦1 

43 

43 

44 

45 

44 

ITAaTXXAi) 

(SUB  OAT) 

A1.UJN» 

U 

AkU.OUT— 

Buirn 

IBULT 

r-**sw,T 

t  f 

«.-aa»4'4  tnuLT-ALaarT 


FETCH 

EXECUTE 

WHITS 

41 

43 

43 

44 

_ 15 _ 

44 

yrAKT  UAD 

0,- 

(ASOLACO 

AlUJN- 

Ut 

AtO.OUT- 

u*u 

BtJTTnt 

tZSUlT 

N-4BUT.T 

!  » 


rcK-rcH**  i*satT-Ai.o.arr 


1  FETCH  1 

1  EXECUTE  1 

1  WRITE  1 

41 

f  ^ 

43 

! 

45  { 

44 

fTABTUAO 

IU~- 

(B.I  P.4X) 

snrr.9<«- 

P.Q 

jaiTT.OUT- 

P<<Q 

Burrex 

uesuLT 

ji«mxsutT 

T  t 

*  txswiT-^airr.ouT 

PICURE  i  STANDARD  PIPELINE  TIMINa.  Ti«  foUoTia*  iaatraetioa  aream  ia  ikova  traTania*  tk* 
ptpdiac 

SUB  D,S^ 

ADDLMS 
SLL  P,QJi 

Tk*  x*axk  k  dm*.  A  Tcrdcal  Oa*  drawa  tkr<i««k  tk*  tkr**  tlaxlbie*  ttoald  fkow  ap  to  tkrac 
kmalataaeoa*  operadoaa  eeeviac  la  tk*  CPU. 


.8 


The  bottom  half  of  the  PCt  AND  ADDRESS  shows  the  memory  address  s«eration. 

There  are  two  sources  for  the  memory  address  joint  to  the  cache,  one  is  the  PCN  foe  jnstractwn 
fetches,  the  other  is  the  ALU  for  loads  and  stores.  The  Cache  Address  Multiplexor  selecU  the- 
eorrect  address  source  and  sends  it  to  the  CACHE  module. 

CACHE 

A  cache  has  not  been  desijned  for  BRISC.  The  ‘cache’  in  the  drawinss  is  a  25fi  by  32.bit 
memory  used  only  to  enable  SCALD  simulation  of  the  BRISC  desisn.  The  actual  cache  would  be 
on  a  separate  board  and  would  be  similar  to  the  cache  designed  for  RISC/E.  The  RISC/E  cache 
board  contains  a  cache.  Translation  Buffer,  and  Page  Usage  Buffer. 

RISC/E  Cache 

The  RISC/E  cache  is  a  3-way  set  aasociatiTe  data  and  instruction  cache  with  2048  33-bit 
words.  The  Une  size  may  be  set  by  software  and  is  rariafale  from  4  to  32  words.  The  number  of 
sets  is  Tariable  from  64  to  1024.  A  write-back  poUcy  and  least  recenUy  used  (LRU)  block  replace- 

ment  algorithm  is  used. 


RISC/E  Translation  Buffer  (TLB) 

The  purpose  of  the  TLB  is  to  store  the  most  recent  rirtnal  to  physical  memory  address 
translations.  Accesses  to  virtual  addresses  not  in  the  TLB  axe  trapped  and  translated  by 
software.  The  TLB  is  a  3-way  set  associative  buffer  that  stores  1024  address  translations  and 
affiociated  memory  protection  and  replacement  information. 


RISC/E  Page  Usage  Buffer  (PUB) 

The  PUB  contains  reference  and  modification  history  for  each  physical  page  frame.  This 
information  is  updated  by  hardware  and  used  by  the  page  replacement  software  to  identify  pages 
to  be  replaced  when  more  physical  page  frames  are  required. 


REGISTER  FILE 

The  REGISTER  FILE  module  contains  two  sets  of  128  33-bit  registers  used  for  the  system 
and  user  register  files.  Two  independent  read  requests,  or  a  single  write  request,  may  be  serviced 
in  a  single  clock  phase.  The  dual  read  is  implemented  by  using  two  redundant  copies  of  the  regis¬ 
ters.  Both  copies  are  written  simultaneously,  but  are  read  independenUy  to  retrieve  two  operands 
at  a  time.  The  register  file  memory  is  implemented  with  eight  256x4-bit  Fujitsu  ECL  RAMs. 
The  RAMs  have  a  7  ns  access  time. 


•  9- 


The  REGISTER  FILE  module  supports  mapping  the  128  registen  into  the  virtual  address 
space  so  that  executing  processes  may  access  any  of  the  128  registers  by  referencing  the  appropri- 
ate  virtual  address. 

S^.1.4.  ALU  PATH 

The  ALU  PATH  module  contains  a  high-speed  ALU  and  carry  lookahead  unit,  input  multi¬ 
plexors  and  input/output  latches.  Each  of  the  two  inpute  of  the  ALU  can  accept  data  from  two 
independent  sources.  The  sources  may  be  either  of  the  two  source  registers  specified  m  the 
instruction,  immediate  daU  contained  in  the  instruction,  the  Current  PC  for  relative  addressing, 
or  the  result  of  the  previous  instruction.  ALU  PATH  output  is  routed  to  the  RESULT  LATCH 
for  writing  to  the  REGISTER  FILE  and  also  to  the  Cache  Address  Multiplexor  m  PC*  AND 
address  to  be  used  as  the  memory  address  for  branches,  loads  and  stores.  All  ALU  operations 

execute  in  a  single  clock  phase. 


*.*.1^.  SHIFTER  PATH 

The  SHIFTER  PATH  module  performs  1-  to  32-bit  left  or  right  shifts  with  optional  sign 
extension.  Input  multiplexors  and  latches  select  from  the  same  set  of  inputs  used  by  the  ALU 
PATH.  The  output  is  latched  and  routed  to  the  RESULT  LATCH  for  writing  to  the  REGISTER 

FILE. 

The  bask  shift  part  provided  in  lOOK  ECL  can  only  do  a  logical  shift  right  or  a  rotate.  The 
BRISC  instruction  set  also  requires  arithmetk  shifU  and  left  shifts.  Arithmetk  shifting  is  pto~ 
vided  by  extending  the  sign  bit  of  the  32-bit  operand  to  an  additional  32  biU  and  makmg  it  the 
upper  32  bite  of  the  M-bit  input  to  the  shifter.  Left  shifting  is  provided  by  having  a  path  into 
the  shifter  that  reverses  the  input  bits  and  then  reverses  the  result  bits  on  output. 

SJ.1.8.  SPECIAL  REGISTERS 

The  SPECIAL  REGISTERS  module  contains  the  Program  Status  Word  (PSW),  Current 
Window  Pointer  (CWP).  and  Saved  Window  Pointer  (SWP).  The  PSW  hold,  the  ALU  condition 
codes  and  a  byte  of  CPU  status  fiags.  The  CWP  contains  the  counters  and  registers  used  to  point 
into  the  currenUy  active  position  in  the  system  and  user  register  files.  The  SWP  points  to  the 
last  words  of  the  system  and  user  register  files  that  are  in  memory  and  provides  detection  of  regis¬ 
ter  window  overflow  and  underflow  for  CALL  and  RETURN. 


.  10- 


JJ.1.T.  RESULT  LATCH 

The  RESULT  LATCH  module  is  composed  of  %  4-way  multiplexer  and  a  32-bit  latch.  It 
serves  to  buffer  the  result  of  a  computetion  at  the  end  of  each  cycle.  The  four  inputs  to  the 
RESULT  LATCH  in  ihe  ALU  PATH  output,  the  SHIFTER  PATH  output,  one  REGISTER 
FHE  output  and  the  Current  PC  from  PCs  AND  ADDRESS.  The  output  is  routed  to  the 
REGISTER  FILE  for  rejister  writes  and  to  the  result  bus  (RBUS)  for  register  forwarding.  Regis¬ 
ter  forwarding  is  required  when  an  instruction  requires  the  result  of  its  preceding  instruction. 
Because  of  the  nature  of  the  pipeline,  the  result  will  not  have  been  written  back  to  the  register 
file,  so  it  is  routed  directly  to  the  input  multiplexors, 

3.J.3.  Cootrol 

RISC  processors  achieve  high  performance  by  optimiiing  the  data  path,  not  by  expanding 
control.  The  simple  instruction  set  results  in  simpler  conttol  than  is  found  in  conventional  proces¬ 
sors.  Control  consists  of  pipelined  latches  driven  by  control  RAMs.  The  RAMs  are  used  to 
decode  instructions  and  control  the  data  path.  Control  is  described  in  terms  of  hardware,  con- 
tenU  of  the  control  RAMs,  and  the  support  processor  used  to  program  the  control  R.\Ms. 


Control  Hardware 

The  CONTROL  module  consists  of  a  pipeline  of  three  latches  that  correspond  to  phases  3, 

4  and  5  of  instruction  execution.  Instructions  are  stepped  through  the  pipe  and  decoded  during 
phases  3  and  4.  Movement  of  control  data  through  the  control  pipeline  corresponds  to  the  move- 
ment  of  data  through  the  execution  pipeline  in  the  daU  path. 

Decoding  of  instructions  is  done  with  RAMs  in  the  PHASE  3  DECODE  RAM  and  PHASE  4 
DECODE  RAM  modules.  Input  to  the  RAMs  consists  of  the  opcode  and  immediate  fields  of  the 
executing  instruction,  the  outpuU  from  the  RAMs  are  the  control  signals  for  the  data  path. 

The  pipeline  is  controlled  by  another  set  of  latches  in  the  PIPELINE  CONTROL  module. 
Input  to  PIPELINE  CONTROL  consists  of  the  Pipe  Control  bits  generated  by  the  PHASE  3 
DECODE  RAM.  These  bits  control  insertion  of  one  or  two  extra  cycles  for  LOAD  and  STOPS 

instructions. 

Control  ‘Mlcroeode’ 

Stone  defines  microprogramming  as  'the  use  of  storage  to  implement  the  control  unit.  ^  By 
this  definition,  the  contento  of  the  decode  RA.Ms  in  CONTROL  are  the  BRISC  equivalent  of 
microcode.  BRISC  microcode,  however,  is  mnch  simpler  than  traditional  microcode.  Tradition¬ 
ally,  a  variable  number  of  microinstructions  are  executed  per  machine  instruction.  Sequencing  of 


-11- 


mkromstructioM  is  nsually  controlled  by  bits  in  the  microinstroctions  snd  a  microinstniction 
counter.  BRISC  always  fetches  exactly  two  microinstmctions  per  RISC  instmction,  thus  BRISC 
does  not  require  a  microinstruction  counter  or  complex  sequencin*  control  Sequencing  of  the 
microcode  is  controUed  only  by  the  instruction  stream;  each  opcode  field  is  decoded  to  the  54-bit 
microcode  word  during  Phases  3  and  4.  Only  two  bits  affect  timing  in  any  way,  they  are  the  Pipe 
Control  bits  described  above. 

A  microcode  assembler  called  DAPL  is  used  to  create  microcode  for  BRISC.  Only  a  sub¬ 
set  of  the  full  capabUity  of  DAPL  is  required  because  of  the  simplicity  of  the  BRISC  microcode. 
DAPL  is  used  primarily  to  aUow  for  symboUc  names  instead  of  absolnte  bit  patterns  m  order  to 
identify  the  field  names  and  field  contents  of  the  microcode.  SymboUc  field  names  and  contents 
simplify  shuffling  bits  as  the  microcode  is  finaUzed. 

The  microcode  for  BRISC  has  not  been  completed.  Enough  microcode  has  been  written  to 
aUow  simubtion  of  the  BRISC  arithmetic  instructions.  A  Usting  of  the  microcode  as  DAPL  mput 

is  included  m  Appendix  F. 

Support  Proonsaor 

The  support  processor  initblizes  contents  of  the  control  RAMs.  The  support  processor  may 
also  be  used  to  debug  the  BRISC  hardware.  Serial  mpuU  and  outputo  of  the  edge-triggered  shift 
registers  used  as  latches  in  BRISC  may  be  tied  together  to  allow  scan-in  scan-out  of  processor 
state  by  the  support  processor.  Some  of  the  fiow-through  latches  currently  used  by  BRISC  may 
be  converted  to  edge-triggered  shift  register  btches  to  provide  more  of  the  processor  state  to  the 
support  processor.  Additional  hardware  may  be  added  to  allow  smgle  steppmg  the  BRISC  CPU 
with  the  support  processor. 

The  mterface  to  the  support  processor  has  not  been  designed,  nor  has  a  support  processor 
been  selected.  It  b  envisioned  that  the  mterface  to  the  support  processor  wiU  have  mmimal 
impact  on  the  BRISC  design,  and  that  any  commercbUy  available  micro  or  minicomputer  may  be 

nsed  for  the  support  processor. 
t,  Performanen  Anaiyab 

BRISC  performance  is  estimated  in  terms  of  benchmark  performance,  and  is  compared  to 
the  benchmarks  run  on  other  high  performance  computers. 


-  12- 


a.1.  BftlSC  Perfornx*ac« 

BRISC  performance  is  determined  by  the  speed  of  the  CPU,  memory,  and  Inpnt/Output 
(I/O)  subsystems  -  and  also  by  the  desree  of  overlap  of  each.  Degradation  of  performance  by  the 
I/O  subsystem  is  assumed  to  be  minimal  for  the  benchmarks  used,  and  is  not  considered.  The 
benchmarks  do  not  include  system  overhead  because  task  time  is  measured  instead  of  elapsed 
time;  therefore,  system  overhead  is  not  considered. 

As  a  result  of  ignoring  I/O  and  system  overhead,  only  CPU  and  memory  subsystem  speed  is 
used  to  determine  BRISC  performance.  CPU  speed  is  first  considered  to  determine  the  CPU  cycle 
time;  then  this  time  is  degraded  by  the  memory  subsystem  overhead  to  calculate  an  estimate  of 
BRISC  performance. 

BRISC  performance  calculation  can  only  be  an  approximation  because  of  the  number  of 
variables  that  affect  performance.  As  a  result,  three  BRISC  performance  figures  are  calculated 
using  three  seta  of  assumptions  for  the  performance  variables.  The  three  performance  figures  pro¬ 
vide  a  simple  sensitivity  analysis  of  the  assumptions  made,  and  are  identified  as  BRISC  A,  BRISC 
B  and  BRISC  C.  BRISC  A  is  a  fictitious  best  case  figure,  unattainable  but  a  reasonable  upper 
bound  on  performance.  BRISC  B  is  a  middle  of  the  road  figure,  with  conservative  estimates  used 
to  calculate  a  plausible  estimate  of  BRISC  performance.  BRISC  C  uses  worst  case  figures  to  give 
a  lower  bound  on  performance. 

8.1.1.  CPU  Speed 

CPU  speed  is  determined  by  the  bask  cycle  time  of  the  CPU  and  the  number  of  cycles  per 
instruction.  These  numbers  are  presented,  and  suggestions  made  for  possibly  improving  the 
BRISC  CPU  speed. 

5.1.1.1.  CPU  Cycle  Time 

The  SCALD  system  timing  verifier  was  used  to  determine  the  BRISC  CPU  cycle  time.  The 
SCALD  timing  verifier  uses  worst  case  minimum  and  maximum  delay  times  through  all  parts  of 
the  design  to  find  timing  errors.  Output  of  the  timing  verifier  may  be  used  to  cakulate  worst  case 
logk  delays.  The  timing  verifier  identified  the  register  file  read  as  the  worst  case  path  through 
the  CPU.  The  register  file  is  in  the  critical  path  because  of  the  time  required  to  do  the  register 
address  calculation  and  the  time  required  to  read  the  register  file  RAM  (7  ns  access  time).  The 
minimum  time  for  a  register  file  read  including  address  calculation  and  RAM  access  time  is  18.S 
u.  Assuming  equal  length  phases,  the  BRISC  minimum  cycle  time  is  therefore  37  ns. 

A  37  ns  cycle  time  assumes  worst  case  logic  delays,  but  no  wire  delays.  Wire  delays  account 
for  about  half  the  cycle  time  of  many  high  performance  CPUs,  so  the  37  as  cycle  time  for  BRISC 


-13- 


needs  to  be  corrected  for  wire  delays.  The  SCALD  system  does  not  inchide  a  physical  desiga  sys¬ 
tem,  so  accurate  Tslaes  for  the  wire  delays  in  BRISC  are  not  arailable. 

SCALD  does  provide  a  means  to  inpat  an  estimate  of  wire  delay.  Wire  delay  input  is 
specified  as  a  minimam  and  maximum  delay  to  be  used  on  all  wires.  The  minimum  and  max¬ 
imum  numbers  are  used  to  find  worst  case  delays  as  determined  by  worst  case  wire  delays  and 
worst  case  logic  delays.  Figure  4  shows  the  affect  of  different  values  of  worst  case  wire  delays  on 
the  BRISC  cycle  time  as  reported  by  the  SCALD  timing  verifier. 


Figure  4.  BRISC  CPU  Cycle  Time 


The  BRISC  CPU  has  been  designed  to  fit  on  a  single  board  to  keep  wire  lengths  short  and 
to  avoid  the  delay  penalty  of  connectors  to  other  boards.  It  is  assumed  that  chips  will  be  located 
on  the  CPU  board  such  that  wire  delay  is  minimiied  through  the  worst  case  path.  For  example, 
the  register  file  address  generation  logic  will  be  located  in  a  small  area  to  minimise  worst  case 
wire  delays. 

A  best  case  wire  delay  of  0.0  ns  a  used  for  BRISC  A,  a  median  wire  delay  of  0.5-1.0  ns  is 
used  for  BRISC  B,  and  a  worst  case  wire  delay  of  0.5-2.0  ns  is  used  for  BRISC  C.  The  resulting 
CPU  cycle  times  are  37  ns  for  BRISC  A,  4«  ns  for  BRISC  B,  and  53  ns  for  BRISC  C.  A  median 
wire  delay  of  0.5-1.0  ns  is  considered  reasonable  because  most  of  the  parts  in  the  critical  path  can 
be  kept  together.  Adjacent  ICs  have  a  0.5  ns  delay  between  them,  so  most  wires  in  the  critical 


•  14 


path  should  be  cloee  to  a  0.5  b»  delay.  A  few  wires  may  be  longer  than  2.0  ns,  but  the  average  in 
the  worst  case  path  is  estimated  to  be  1.0  ns. 


S.1.1^.  CPU  Cyclea  Per  Instruction 

As  stated  in  Section  3.1,  BRISC  completes  an  instruction  every  cycle  except  for  the  LOAD 
and  STORE  instructions  that  add  an  extra  one  or  two  cycles.  RISC  I  requires  one  extra  cycle  for 
LOAD  and  STORE,  and  this  extra  cycle  was  considered  in  the  original  RISC  I  benchmarks.  These 
benchmarks  will  be  used  to  calculate  BRISC  performance,  so  the  affect  of  a  single  extra  cycle  for 
loads  and  stores  will  be  included.  RISC  I  did  not  require  a  third  cycle  for  unaUgned  loads  and 
stores  as  required  by  BRISC.  The  benchmarks  considered,  however,  only  use  32-bit  data,  so  all 
loads  and  stores  in  the  benchmarks  require  only  two  cycles,  and  the  third  cycle  may  be  safely 

ignored. 

The  affect  of  nnaligned  loads  and  stores  on  programs  other  than  the  benchmarks  should  be 
minimal.  Loads  and  stores  can  be  conservatively  estimated  to  occur  in  20%  of  RISC  I  instruc¬ 
tions.  Even  if  half  of  the  loads  and  stores  are  nnaligned,  the  total  affect  on  BRISC  performance  is 
an  8%  degradation  in  throughput.  This  performance  degradation  may  be  avoided  by  using  only 

32'bit  data. 

Design  Changes 

An  advantage  of  using  SCALD  for  designing  a  computer  is  that  experiments  may  be  tried 
with  slightly  different  architectures  to  find  the  overaU  impact  on  performance  and  cost  of  altemap 
tive  architectures.  An  architectural  feature  is  justified  only  if  the  benefiU  of  the  feature  outweigh 
the  cost.  For  example,  if  a  feature  adds  3  as  per  phase  (6  ns  per  cycle)  to  the  critical  path 
because  of  a  single  logic  delay  and  a  wire  delay,  then  that  feature  increases  the  47  ns  cycle  time 
of  BRISC  B  by  13%.  The  feature  is  therefore  not  worthwhile  unless  it  makes  up  for  the  13% 

decrease  in  CPU  cycle  time. 

Even  if  a  feature  does  not  add  gate  delays  to  the  critical  path,  if  additional  chips  are 
required,  the  room  required  by  the  additional  chips  may  add  wire  delays  to  the  critical  path,  and 
the  additional  chips  add  to  the  total  ccet.  Performance  and  cost  impact  is  much  more  severe  if 
an  additional  board  is  required. 

A  few  architectural  experiments  have  been  run  on  the  BRISC  design.  The  ALU  PATH, 
SHIFTER  PATH,  and  CONTROL  modules  were  redesigned  several  times  to  decrease  parts  count 
and  increase  speed.  This  section  describes  some  further  experiments  that  should  be  run  to  find 
the  cost  of  specific  architectural  features.  The  changes  are  grouped  under  design  deletions,  design 
enhancements,  and  other  changes. 


.15. 


Design  Deletions 

SEPARATE  SYSTEM  AND  USER  REGISTER  FILES:  The  RISC  I  architecture  does  not 
include  separate  system  and  user  rejister  files.  The  separate  flies  were  added  by  RISC/E  because 
the  single  128-word  register  file  required  by  the  RISC  I  architecture  occupied  only  half  of  the 
2S0-word  RAM.  The  unused  half  of  the  RAM  was  allocated  to  a  second  register  file.  Separate 
system  and  user  register  files  do  not  directly  affect  BRISC  CPU  performance,  but  they  do  add 
complexity  and  parts  to  the  BRISC  CPU  design  and  as  a  result  may  impact  overaU  CPU  perfor¬ 
mance.  Complexity  is  added  because  ten  registers  of  each  register  file  are  devoted  to  the  global 
registers,  which  are  shared  by  all  register  windows.  In  order  to  avoid  addressing  the  global  regis¬ 
ters  in  the  register  file  RAMs,  additional  RAMs  are  used  (the  ADDSUB  module)  to  increment  the 
Current  Window  Pointer  (CWP).  If  the  system  registers  were  deleted,  the  global  registers  could 
be  moved  to  the  unused  half  of  the  register  file  RAM,  counters  could  be  used  instead  of  the 
latches  currently  used  in  the  CWP,  and  ADDSUB  deleted.  The  system  CWP  and  SWP  could 
also  be  deleted,  for  a  total  chip  savings  of  over  9  chips.  CPU  performance  may  be  improved 
because  of  the  smaller  number  of  parts,  but  system  performance  may  be  degraded  because  of 
slower  switching  between  system  and  user  modes  (but  only  if  system  mode  required  a  separate 
stack  from  user  mode). 

REGISTER  FILE:  Since  the  register  file  is  in  the  critical  path,  an  interesting  experiment 
would  be  to  delete  the  register  file  entirely  and  replace  it  with  a  single  set  of  registers  as  used  in 
conventional  architectures.  Then  register  address  calculation  would  be  greatly  simplified  because 
the  CWP  would  no  longer  be  required  and  the  register  address  determined  only  by  the  register 
number  bits  in  the  instruction.  Deleting  the  register  file  would  save  at  least  two  gate  delays  and 
anociated  wire  delays  for  a  total  savinp  of  at  least  5  ns  per  phase  (10  ns  per  cycle).  A  savings  of 
10  ns  translates  to  a  21%  performance  increase  for  BRISC  B.  More  than  30  chips  would  also  be 
deleted  from  the  design.  The  resulting  CPU  would  be  significantly  faster  and  smaUer,  but  calls, 
returns,  and  accesses  to  local  variables  would  take  many  more  instructiona  than  required  by  the 
RISC  I  instruction  set  with  the  register  file.  More  analysis  needs  to  be  performed  to  find  the  true 
cost  of  the  register  file. 

REVERSE  SUBTRACT:  The  REVERSE  subtract  instruction  allows  the  two  source 
operands  of  an  instruction  to  be  subtracted  in  the  reverse  order  from  the  SUBTRACT  instruction 
and  is  of  questionable  utility.  REVERSE  SUBTRACT  has  not  been  deleted  because  it  does  not  seem 
to  have  any  impact  on  the  BRISC  CPU  performance  or  cost. 


•  10 


Design  Additions 

MULTIPLY:  As  will  be  shown  in  the  benchmark  section,  BRISC  performs  well  on  bench¬ 
marks  with  multiplies,  eTcn  without  a  A/M,  riPtr  instruction.  But  BRISC  could  be  made  eren  fas-* 
ter  if  MULTIPLY  could  be  added  with  little  impact  to  the  rest  of  the  CPU.  This  could  be  accom¬ 
plished  by  adding  a  single  chip  multiplier  in  parallel  with  the  ALU.  Currently  ayailable  single 
chip  multipliers  are  significantly  slower  than  the  BRISC  CPU  cycle  time,  but  are  getting  faster. 
As  faster  multipliers  become  available,  a  multiplier  will  be  a  useful  addition  to  the  BRISC  CPU. 

FLOATING  POINT:  BRISC  is  not  well  suited  for  problems  with  heavy  floating  point 
emphasis  because  of  the  lack  of  floating  point  hardware  in  the  BRISC  CPU.  Floating  point  per¬ 
formance  of  BRISC  could  be  enhanced  by  adding  hardware  to  do  the  time  consuming  tasks  of 
floating  point  operations.  This  is  in  keeping  with  the  RISC  philoeophy  of  adding  hardware  only 
for  time  consuming  tasks.  A  possible  addition  for  floating  point  is  the  multiplier  mentioned  in  the 
previous  paragraph.  Another  addition  would  be  a  leading  xeroes  counter  to  be  used  with  the 
shifter  for  normalisation.  Ideally,  changes  for  floating  point  should  not  complicate  control, 
instead  they  should  only  enhance  the  data  path  for  floating  point  operations.  More  research  is 
required  to  identify  the  best  way  to  add  floating  point  to  a  RISC  processor. 

Redesign 

MULTIPLEXORS  VS.  OPEN-EMITTER  BUSSES:  BRISC  uses  two  methods  for  selecting 
one  of  several  sources  into  an  input.  One  is  a  multiplexor  on  the  input;  the  other  is  an  OR-gate 
for  each  of  the  sources  with  outputs  tied  together  on  an  open-emitter  bus.  Examples  of  multi¬ 
plexor  input  can  be  found  m  the  ALU  PATH  and  SHIFTER  PATH  inputs.  An  example  of  an 
open-emitter  bus  is  the  result  bus  (RBUS)  which  is  driven  by  the  Last  PC,  Current  PC  and  the 
RESULT  LATCH.  Open-emitter  busses  have  less  logic  delay,  but  can  have  longer  wire  delays 
because  of  the  length  of  the  busses  and  transmission  line  effects  along  the  busses.  Open-emitter 
busses  should  only  be  used  if  the  sources  are  close  together,  or  if  there  are  too  many  destinations 
to  add  multiplexors  on  every  input.  More  analysis  may  show  that  BRISC  should  use  a  different 
mix  of  multiplexors  and  open-emitter  busses  to  increase  performance  or  decrease  cost, 

SHARED  MULTIPLEXORS:  Bussing  may  also  be  improved  by  combining  multiplexors. 
For  example,  analysis  may  show  that  the  SPECIAL  REGISTER  input  multiplexor  may  be  com¬ 
bined  with  the  ALU  PATH  input  multiplexor  for  a  savings  of  six  parts.  Sharing  multiplexors 
may  have  an  adverse  affect  on  CPU  performance  because  of  longer  wires  caused  by  the  sharing  of 
a  single  multiplexor  between  two  logical  portions  of  the  CPU. 

REGISTER  ADDRESS  CALCULATION:  As  stated  previously,  register  address  cakubtion 
is  a  major  contributer  to  the  critical  path  of  the  BRISC  CPU.  Register  address  calcubtion  should 
be  redesigned  to  save  gate  and  wire  debys  in  the  critical  path.  One  or  two  gate  delays  may  be 


-17- 


MTed  by  reallocatins  acme  of  the  random  losie  used  for  rejister  addrem  cafcuUtion.  The 
expected  performance  increase  is  between  2  and  5  ns  per  phase  (4  to  10  ns  per  cycle),  for  a  perfor- 
mance  gain  of  between  0  and  21%  for  BRISC  B. 

Memory  Speed 

Section  3.1.1  derired  the  CPU  cycle  time  for  BRISC  and  muested  methods  of  improring 
the  cycle  time.  CPU  cycle  time  was  derired  for  BRlSCs  A,  B.  and  C  by  using  three  sets  of 
sssumptions  for  wire  delay.  The  purpose  of  this  section  is  to  derive  numbers  for  the  cycle  tunes 
of  BRISCs  A,  B,  and  C  weighted  by  the  delays  caused  by  the  memory  subsystem.  These  numbers 
are  used  to  calculate  the  speed  of  BRISCs  A.  B.  and  C  relative  to  a  400  ns  RISC  H  processor. 
Performance  of  a  400  ns  RISC  H  processor  is  not  degraded  by  the  memory  subsystem  because 
address  translation  and  main  memory  access  occur  within  a  single  400  ns  cycle,  so  no  cache  or 

TLB  is  required. 

Memory  subsystem  performance  is  determined  by  many  factors  ineludmg  cache  perfor¬ 
mance,  virtual  memory  performance,  and  access  time  of  main  memory.  These  factors  interact 
wUh  each  other  and  with  the  CPU  performance. 

This  section  assumes  that  the  overhead  incurred  by  the  virtual  memory  for  BRISC  is 
minimal.  Virtual  memory  overhead  includes  virtual  address  transUtion  time  and  page  fault  time. 
Virtual  address  translation  time  can  be  kept  low  with  the  use  of  a  Translation  Buffer  (TLB).  For 
example,  the  observed  hit  ratio  for  the  Amdahl  470V/e  TLB  is  about  W.fi  to  99.7%.  «  The 
penalty  incurred  by  page  faults  can  be  kept  low  by  using  a  large  physical  memory  to  keep  the 
page  fault  rate  low,  and  by  switching  processes  during  disk  accesses  caused  by  page  faulte  to 

minimize  their  affect. 

If  virtual  memory  overhead  is  assumed  to  be  low,  then  memory  subsystem  performance  is 
determined  by  the  cache  performance  and  main  memory  access  time.  Factors  affecting  cache  per¬ 
formance  include  cache  size,  line  size,  set  size,  cache  bandwidth,  main  memory  bandwidth,  cache 
fetch  algorithm,  placement  algorithm,  replacement  algorithm,  write-through  vs.  wnte-back,  cache 
priorities,  and  prefetch.  Smith  provides  an  exceUent  discussion  of  these  factors.  2^  Appendn  B 
shows  the  variabUity  of  these  factors  for  some  typical  commercial  computers  as  presented  by 
Smith  2t  ,  Pier  and  Clark.  * 

For  purposes  of  determining  the  memory  subsystem  performance,  the  cache  performance 
f«:tors  mentioned  in  the  previous  paragraph  may  be  summarized  by  three  performance  numbers: 
the  access  time  of  the  cache,  the  cache  hit  ratio,  and  the  memory  waiting  time  due  to  cache 
misses.  Access  time  of  the  cache  is  the  time  measured  from  the  start  of  a  memory  request  by  the 
CPU  to  the  time  the  request  is  fulfilled  by  the  cache  when  there  is  a  cache  hit.  Cache  hit  ratio  is 
the  percentage  of  CPU  memory  references  that  are  found  in  the  cache.  Memory  waitmg  time  is 


18- 


the  avenge  time  that  the  CPU  moat  wait  for  a  cache  min.  * 

The  access  time  of  the  cache  should  be  designed  to  be  at  least  as  fast  as  the  cycle  time  of 
the  CPU,  otherwise  the  CPU  will  never  rnn  at  fall  speed.  The  lOK  ECL  RISC/E  cache  is  the 
same  speed  as  the  lOK  ECL  RISC/E  CPU;  therefore,  it  is  assnmed  that  a  lOOK  ECL  cache  may 
be  built  for  BRISC  with  the  same  cycle  time  as  the  lOOK  BRISC  CPU. 

Best  case  performance  for  the  memory  subsystem  is  a  hit  rate  of  100%  so  that  the  CPU 
always  runs  at  full  speed;  BRISC  A  assumes  a  100%  hit  rate  as  the  upper  bound  on  performance. 

Figure  5  plots  the  affect  of  cache  hit  ntio  and  memory  waiting  time  on  BRISC  B  and 
BRISC  C  performance  (Appendix  C  includes  the  data  used  to  generate  Figure  5).  Three  curves 
are  shown  for  each  processor,  one  each  for  400,  800,  and  1200  ns  memory  wait  times.  BRISC  A 
with  no  memory  wait  time  is  also  included  for  reference. 

Smith  presents  analytic  results  of  cache  hit  ratios  of  over  W%  for  typical  caches.  He  also 
presents  empirical  results  of  cache  hit  ratios  of  over  98%  for  user  sUte  programs  on  the  Amdahl 
470  with  a  16K  cache.  Simulations  performed  by  DEC  for  the  PDP-11/70  cache  showed  a  hit 
ratio  of  over  98%  for  a  2K  cache  with  a  line  site  of  four  words.  ^  Baaed  on  these  results  BRISC  B 
will  assume  a  cache  hit  ratio  of  98%  as  a  reasonable  estimate  of  cache  performance. 

Both  the  analytic  and  empirical  curves  presented  by  Smith  seldom  show  less  than  a  95%  hit 
ratio.  DEC’S  simulations  also  show  better  than  a  95%  hit  ratio  for  a  2K  cache  with  a  line  site  of 
one.  As  a  worst  ease  estimate,  BRISC  C  will  use  a  95%  hit  ratio. 

Memory  waiting  time  may  be  approximated  by  the  cycle  time  of  main  memory,  as  long  as 
the  line  size  equals  the  size  of  the  path  to  main  memory.  Memory  waiting  time  may  be  made  less 
the  cycle  time  of  main  memory  by  using  write  buffers,  but  some  of  this  gain  may  be  lost 
because  of  interference  from  pre-fetches.  Smith  reports  typical  main  memory  access  times  of  300 
to  600  ns  for  the  Amdahl  470V/7  and  IBM  3033.  These  numbers  are  similar  to  the  main  memory 
access  times  of  smaller  computers  such  as  the  Sun  Wortstationt,  which  has  a  main  memory  access 
time  of  400  ns  including  virtual  address  translation  time.  ^  Clark  reports  a  main  memory  access 
time  of  1200  to  1400  ns  for  the  VAX-11/7S0.  *  BRISC  A  is  not  affected  by  main  memory  access 
time  since  it  has  a  100%  hit  ratio;  BRISC  B  assumes  a  400  ns  main  memory  access  time  and 
BRISC  C  assumes  a  1200  ns  main  memory  access  time. 


I  Sm  Waritiatiau  ii  a  rapitere-i  trademait  ol  Ss*  Mkreir»i«aiJ,  lae. 


(63na) 

(46ns) 

(□Tns) 


Figure  5.  BRISC  Effective  Cycle  Time 


-20. 


U.  Benchmark  Comp«r!iona 

Tsble  1  summariies  the  perfonaaace  of  BBISCs  A.  B.  aad  C  baaed  on  the  result,  of  Section 
3.1.  The  cycle  times  of  37,  53  and  120  ns  were  divided  into  the  400  ns  cycle  tune  of  RISC  B  to 
detennine  the  ratio  of  BRISC  cycle  time  to  RISC  U  cycle  time,  which  is  shown  in  the  bottom  line 
of  Table  1.  This  ratio  is  used  to  calculate  BRISC  performance  in  the  foUowins  benchmarks. 


Table  1.  BRISC  A.  B  and  C  Performance  Summnrjr 


FACTOR 

BRISC  A 

BRISC  B 

BRISC  C 

Logic  Cycle  Time  (na) 

Clock  Skew  (±  na) 

Minimum  Wire  Delay  (na) 

Maximum  Wire  Delay  (as) 

37 

0.0 

0.0 

0.0 

37 

0.1 

0.5 

1.0 

37 

0.1 

0.5 

2.0 

TnTAl.  rPil  CYCLE  TIME  (ns) 

37 

46 

63 

100% 

0 

98% 

400 

95% 

1200 

EFFECTIVE  CPU  CYCLE  TI\iE  (ns) 

37 

53 

120 

BRISC/RISC  n  Cycle  Time  Ratio 

.09 

.13 

.30 

The  original  RISC  II  benchmarks  were  run  with  a  RISC  simulator.  The  RISC  simulator 
takes  into  account  the  extra  cycle  required  for  loads  and  stores,  and  also  takes  into  account  the 
affect  of  register  window  spilb  into  main  memory.  Three  benchmarks  are  presented,  the  Puzzle 
program.  TAK  LISP  benchmark,  and  UNIX  PorUbie  C  CompUer.  Three  benchmarks  are  used  to 
measure  BRISC  performance  for  different  appUcations.  Each  benchmark  presents  the  time  in 
seconds  to  execute  the  benchmark.  Benchmark  performance  is  also  presented  a.  a  multiple  of 
VAX-11/780  performance,  shown  as  the  ratio  of  the  number  of  seconds  required  by  the  VAX 
divided  by  the  number  of  seconds  required  by  the  computer  being  measured.  VAX  performance  is 
used  because  the  VAX  has  become  a  pseudoetandard  foe  comparison,  and  because  a  BRISC  com¬ 
puter  could  be  built  for  about  the  cost  of  a  VAX. 


Pussia  Pro^am 

The  Puzzle  program  is  a  recursive  puzzle  solving  program  originated  by  Forest  Baskett  that 
has  been  used  to  benchmark  many  mainframe  and  minicomputers.  Selinger  presents  listings  of 
the  Puzzle  program  and  sources  of  Puzzle  benchmark  data  for  many  computers.  Patterson  and 
S^nin^^  idenUfy  benchmark  results  for  a  400  ns  RISC  L  Table  2  lists  the  results  of  Puzzle 
benchmarks  for  BRISCs  A,  B  and  C  relative  to  the  Selinger  and  Patterson  numbers.  Table  2  only 
Bat.  the  better  numbers  where  conflicting  data  is  available  for  the  same  machine.  BRISCs  A  and 
B  outperform  all  available  Puzzle  times;  even  the  worst  case  BRISC  C  outperformed  all  bench- 
marks  except  for  the  best  time  reported  for  the  Amdahl  470V/8. 


-21 


1  Table  S.  Pussle  Benchmark  Results 

TIME 

LANGUAGE 

COMMENTS 

CPU 

COST 

MACJULNBi 

sec 

BRISC  A  (37  as) 

0.3 

11.8 

C 

pointers 

$132K 

BRISC  B  (53  ns) 

0.4 

8.3 

C 

pointers 

$132K 

Amdahl  470V/8 

0.7 

5.0 

C 

pointers,  reg  rars 

IllOOK 

BRISC  C  (120  ns) 

1.0 

3.6 

C 

pointers 

Amdahl  470V/8 

1.1 

3.2 

C 

pointers 

IllOOK 

Amdahl  470V/8 

1.1 

3.2 

Pascal 

no  range  checking 

IllOOK 

Amdahl  470V/8 

1.2 

2.0 

C 

reg  rars 

IllOOK 

IBM  3081 

1.4 

2.5 

Pascal 

I3260K 

Amdahl  470V/8 

1.6 

2.2 

C 

subscripts 

IllOOK 

S-lMark  1 

2.0 

IS 

Pascal 

subscripts 

I2000K 

Dorado 

2.0 

IS 

Mesa 

pointers 

IllSK 

Amdahl  470V/8 

2.3 

1.5 

Pascal 

IllOOK 

Dorado 

3.0 

1.2 

Mesa 

subscripts 

I118K 

RISC  n  (400  ns) 

3.2 

1.1 

C 

pointers 

IlOK 

VAX  11/780 

3.5 

1.0 

C 

pointers 

I132K 

DEC  2060 

4.4 

0.8 

Pascal 

pointers 

I502K 

DEC  2060 

4.6 

0.8 

C 

pointers,  hand  opt 

I502K 

VAX  11/780 

4.6 

0.8 

C 

subscripts 

I132K 

RISC  n  (400  ns) 

4.7 

0.7 

C 

subscripts 

IlOK 

DEC  2060 

4.7 

0.7 

C 

snbscripts,  hand  opt 

I502K 

DEC  2060 

6.3 

0.7 

C 

pointers 

I502K 

DEC  2060 

6.4 

0.6 

Pascal 

subscripts 

I502K 

DEC  KLIO 

6.0 

0.6 

C 

pointers 

I540K 

VAX  11/780 

6.1 

0.6 

Pascal 

snbscripts 

I132K 

PDP  11/70 

6.4 

0.5 

C 

pointers 

I70K 

DEC  2060 

7.3 

0.5 

C 

subscripts 

I502K 

IBM  158 

7.5 

0.5 

Pascal 

subscripts 

I1550K 

68000  (8  MHs) 

17.0 

0.2 

C 

pointers,  nunix 

IlOK 

TAX  Benchmark 

The  TAK  benchmark  is  a  hearily  recnniTe  fnnetion  nsed  to  measve  the  efficiency  of  LISP 
procedure  calls  as  well  as  the  effKiency  of  LISP  bnnm  and  bijnnm  arithmetic.  Ffatnnm  arith¬ 
metic  refers  to  integers  of  bounded  length,  whereas  bignum  arithmetic  refers  to  integers  of 
unbounded  length.  Figure  «  is  a  Ikting  of  the  TAK  benchmark.  Table  3  presenU  the  resulU 
reported  by  Ponder  compared  to  the  times  calculated  for  BRISC.  BRISCs  A  and  B  again  outper¬ 
form  all  arailable  benchmarks;  BRISC  C  is  only  beat  by  the  Dorado. 

Ponder  estimated  RISC  I  performance  for  LISP  by  compiling  the  LISP  code  on  a  VAX  and 
hand  translating  the  output  to  RISC  I  assembly  code.  Ponder  states  that  a  LISP  compiler  optim¬ 
ised  for  RISC  could  be  built  with  better  performance  than  is  shown  in  Table  3.  Recent  studies 
by  Ponder  show  that  with  minor  enhancements  a  400  ns  RISC  processor  conld  be  built  that  exe¬ 
cuted  the  TAK  benchmark  in  under  0.06  seconds.  This  translates  to  times  of  0.06,  0.09  and  0.20 
seconds  for  BRISCs  A,  B,  and  C  (5.6  to  18.0  times  faster  than  a  VAX). 


-22 


(tak  18  12  6) 

(deftto  tak(x  j  x) 

(cond  ((not  (leaap  y  x))  i) 

(t  (tak  (tak(sabl  x)  y  x) 
(tak  (snbl  y)  x  x) 
(tak  (subl  x)  X  y))))) 


F1gur«  8.  TAX  Benchmark 


Table  3. 

TAK  Benchmark  Result* 

MACHINE 

TIME 

LANGUAGE 

CPU 

COST 

see 

VAX  sec/sec 

BRISC  A  (37  ns) 

0.2 

5.9 

PSL/Franx  LISP 

$132K 

BRISC  B  (53  ns) 

0.3 

4.2 

PSL/Franx  LISP 

$132K 

Dorado 

0.5 

2.2 

IttterLISP 

IllSK 

BRISC  C  (120  ns) 

O.S 

1.8 

PSL/Franx  LISP 

I132K 

DEC  KLIO 

0.8 

1.4 

MacLISP 

$540K 

VAX  11/780 

1.1 

1.0 

Franx  LISP 

$132K 

RISC  n  (400  ns) 

2.0 

0.6 

PSL/Franx  LISP 

SlOK 

68000  (8  MHx) 

2.0 

0.9 

PSL  SYSLISP 

tlOK 

Dolphin 

5.7 

0.2 

InterLISP 

$40K 

C  Compiler 

Miros  ported  the  Portable  C  Compiler  by  Stere  Johnson  *  to  RISC  L  After  porting  the 
compfler,  Miros  compared  the  performance  of  the  VAX  Portable  C  CompUer  mnning  on  a  VAX  to 
the  VAX  Portable  C  Compfler  running  on  a  400  ns  RISC  L  His  resulu  are  presented  in  Table  4. 
BRISCs  A,  B,  and  C  are  all  significantly  faster  than  a  VAX. 


Table  4.  VAX  Portable  C  ComoUer  Benchmark  Result* 

MACHINE 

LD.C 

COMPILE  TIME 

SORT.C 

COMPILE  TIME 

PUZZLE.C 
COMPILE  TIME 

CPU 

COST 

see 

VAX  sec/sec 

see 

VAX  sec/see 

sec 

BRISC  A  (37  ns) 
BRISC  B  (53  as) 
BRISC  C  (120  ns) 
RISC  n  (400  ns) 
VAX  11/780 

1.6 

2.2 

5.1 

16.9 

27.9 

1.0 

1.4 

3.2 

10.6 

17.4 

17.7 

12.4 

5.5 

1.8 

1.0 

0.3 

0.4 

0.9 

2.9 

5.2 

-23- 


Bcnehmark  Sumnutrjr 

Fi«ure  7  is  »  summary  of  some  of  the  results  from  the  above  benchmarks.  The  best  times 
are  shown  for  four  mainframes,  two  minicomputers,  and  two  microcomputers.  Performance  is 
shown  as  a  multiple  of  VAX  performance.  BRISCs  A  and  B  outperform  aU  the  listed  computers, 
and  BRISC  C  is  competiUve.  Using  BRISC  B  as  the  best  estimate  of  BRISC  performance,  BRISC 
a  8  to  13  times  faster  than  a  VAX  for  C  programs,  and  4  times  faster  than  a  VAX  for  LISP  pro¬ 
grams. 

4.  Cost 

Much  as  performance  is  determined  by  the  combination  of  many  factors,  so  is  cost  deter¬ 
mined  by  many  factors.  Selinger  defines  computer  workstation  cost  as  the  sum  of  the  manufac¬ 
turing  and  labor  costs  of  parts,  boards,  cables,  logic  cages,  backplanes,  cabinets,  power  distribu¬ 
tion  back  panel,  front  panel,  assembly,  and  test,  He  then  defines  price  as  the  sum  of  manufac¬ 
turing  coat,  development  coat,  sales  cost,  service  cost,  and  profit  or  loss.  It  is  beyond  the  scope  of 
this  paper  to  esUmate  the  values  of  all  these  items  to  calculate  the  price  of  a  computer  built 
around  the  BRISC  CPU.  Instead,  the  cost  of  a  BRISC  computer  will  be  estimated  by  comparing 
BRISC  to  an  existing  computer  to  calculate  the  relative  price  of  BRISC.  The  Dorado  ®  computer 
by  Xerox  was  selected  as  the  existing  computer  for  comparison  because  it  is  also  an  ECL  com¬ 
puter,  and  is  about  the  same  sixe  as  a  system  built  around  BRISC. 

The  Dorado  is  a  high-performance  personal  computer  consisting  of  3200  medium  scale 
integrated  componente  (not  including  memory),  moot  of  which  are  ECL  lOK.  “  The  Dorado  con¬ 
tains  everything  a  BRISC  computer  would  need  including  up  to  8  Megabytes  of  mam  memory,  a 
high-performance  cache  with  a  30  ns  cycle  time,  peripheral  interfaces,  and  a  large  2050  watt 
power  supply  producing  sufficient  power  and  ECL  voltages  (-5V  at  25QA  and  -2V  at  75A). 

The  coot  for  a  BRISC  computer  is  calculated  by  comparing  the  coot  of  the  BRISC  CPU  to 
the  Dorado  CPU  and  keeping  all  other  coots  the  same.  The  dilfereace  in  cost  of  the  two  CPUs  is 
found  by  first  determining  the  difference  in  the  number  of  chips.  To  compare  the  number  of  chips 
in  the  Dorado  with  the  number  of  chips  in  BRISC,  a  lower  bound  is  determined  for  the  number  of 
chips  in  the  Dorado  that  would  be  replaced  by  the  BRISC  CPU. 

The  Dorado  consbts  of  up  to  24  boards  with  up  to  288  chips  per  board.  Five  of  the  24 
boards  constitute  the  Dorado  CPU  and  are  listed  in  Table  5.  Table  5  also  lists  a  lower  bound  for 
the  number  of  chips  on  each  of  the  five  boards  in  the  CPU. 

About  half  of  the  Instruction  Fetch  Unit  (IFU)  is  used  to  control  the  memory,  the  other  half 
■  devoted  to  the  CPU;  the  IFU  will  not  be  considered  as  part  of  the  CPU  to  obtam  a  conserva¬ 
tive  estimate. 


#  VAXa 


A  IJ  470V/a  C  30B1  Mark  I  Dorado  li  1 1/7B0  KDIO  (8  MHz) 

MACHINE 

Figure  7.  Performance  Summary 


-25 


Table  S. 

BOARD 

%  POPULATED 

*  CHIPS  (288  Max) 

Instruction  fetch  unit 

80% 

230 

Processor  high  byte 

90-95% 

260-275 

Processor  low  byte 

90-95% 

280-275 

Control  section 

90-95% 

260-275 

Microinstruction  memory 

90-95% 

280-275 

The  Dorado  CPU  perforxna  task  rwitchini  at  the  microcode  leyeL  Microcode  tasks  are 
called  mieTot(uk$.  Some  microtasks  are  nsed  as  device  controllers  -  a  DMA  controUer  would 
therefore  be  required  if  the  Dorado  CPU  was  removed.  A  DMA  controller  would  be  no  more  com- 
plex  than  the  Control  Section  of  the  Dorado,  so  the  Control  Section  will  not  be  counted  as  part  of 

the  Dorado  CPU. 

Ignoring  the  IFU  and  Control  Section  leaves  three  boards  in  the  Dorado  CPU:  the  Processor 
High  Byte,  Processor  Low  Byte,  and  the  Microinstruction  Memory.  These  three  boards  contain 
about  780  chips.  The  332  lOOK  chips  of  the  BRISC  CPU  would  occupy  the  same  board  space 
required  by  about  864  lOK  chips  because  of  the  24-pin  packages  used  by  the  lOOK  parte  versus 
the  Ifl-pin  packages  used  by  the  lOK  parts.  The  BRISC  CPU  therefore  requires  15%  less  board 
area  than  the  Dorado  CPU.  InserticMi  cost  is  also  15%  less  for  BRISC;  24-pin  packages  cost  twice 
a,  much  to  insert  as  lO-pin  packages.  but  BRISC  hwi  43%  the  number  of  chips  of  the  Dorado 

CPU. 

In  order  to  calculate  a  conservative  estimate  for  the  cost  of  a  BRISC  CPU,  the  Dorado  and 
BRISC  CPUs  are  assumed  to  have  the  same  PC  board  and  IC  insertion  costs.  PC  board  and 
insertion  costs  being  equal,  the  difference  in  cost  of  the  two  CPU.  will  therefore  be  largely  deter- 
mined  by  the  difference  in  part.  cost.  Parts  cost  of  the  Dorado  CPU  is  not  known  (except  by 
Xerox).  The  least  expensive  lOK  part  sell,  for  81  cenU  in  quantities  of  ICO  to  1000,  so  a  lower 
bound  on  parts  cost  for  the  Dorado  CPU  U  $475.80.  The  BRISC  parts  list  (Appendix  D)  shows 
the  parts  cost  of  BRISC  is  $3330.32  in  parts  quantities  of  100  to  1000.  The  resulting  parts  cost 
difference  between  the  BRISC  CPU  «>d  the  Dorado  CPU  is  at  most  $2854.52.  Selinger  calculated 
the  markup  for  the  VAX-11/780  to  be  5.2  times,  which  he  showed  to  be  consistent  with  the  mark¬ 
ups  used  by  other  manufacturera  Using  a  5.2  times  markup  for  the  parts  differential  between 
BRISC  and  Dorado  gives  a  selling  price  difference  of  $14,843.50.  Dorado.  seU  for  $129,500  with  2 
Megabytes  (MB)  of  main  memory  and  an  80  MB  dfak.  Discounting  for  the  cost  of  an  80  MB  disk 
($11,900)  gives  a  selling  price  of  $117,800  for  the  Dorado  without  peripherals.  A  BRISC  com¬ 
puter  could  therefore  be  sold  for  less  than  $14,843.50  over  the  $117,600  selling  price  of  a  Dorado, 
or  about  $132,444.  This  is  neariy  identical  to  the  price  of  a  VAX-11/780  without  peripherals, 
which  SeUnger  calculates  to  be  $132,400  with  1  MB  of  main  memory  (the  Dorado  comes  with  2 


-28- 


MB). 

(.  Closing  Remarks 

Thb  sectiou  presenU  retrospectives  by  the  BRISC  designers  on  the  BRISC  project. 
f»l.  Comments  on  SCALD 

The  BRISC  project  would  not  be  at  the  point  it  is  today  without  the  SCALD  system. 
SCALD  has  proven  to  be  an  invaluable  tool  for  computer  design.  Traditionally,  computer  design 
has  been  a  rigid  process  because  of  the  difficulty  of  making  changes  to  hardware  once  it  is  built. 
SCALD  provides  a  vehicle  to  do  esp/orstory  Aardware  design,  much  as  a  good  programmmg 
environment  allows  esplorstory  programming.  Bean  Sheil  defines  exploratory  programming  as  an 
approach  to  programming  combining  system  design  with  implementation.  »  In  an  exploratory 
programming  environment,  the  programmer  experiments  with  changes  to  the  design  and  immedi¬ 
ately  sees  their  effect  on  the  operation  of  the  system.  Exploratory  hardware  design  combines 
design  and  implemenUtion  in  the  sense  that  the  hardware  designer  can  make  changes  to  the 
hardware  design  and  immediately  see  the  effect  on  the  operation  of  the  hardware  system.  Effects 
aie  seen  by  using  timing  verification,  simulation  and  post  processing  rather  than  by  building  and 
testing  physical  hardware.  In  this  way  many  possible  solutions  to  the  same  problem  may  be  tried 
and  evaluated,  and  the  best  solution  selected.  The  best  solutioa  is  picked  on  the  basis  of  logical 
correctness,  speed  and  parts  cost. 

SCALD  provides  the  equivalent  of  what  Sheil  calls  programming  power  tools  in  the  form  of 
five  application  programs:  a  graphics  editor,  compiler,  timing  verifier,  simulator,  and  post  proces¬ 
sor.  These  tools  amplify  the  power  of  the  hardware  designer  so  that  he  can  produce  a  better 
design  in  less  time. 

As  with  moot  software  tools,  the  tools  provided  by  the  Valid  SCALD  system  have  several 
deficiencies  that  detract  from  the  usefulness  of  the  system,  the  moot  noteable  being  the  speed  of 
the  system.  This  section  describes  the  speed  of  the  Valid  SCALD  system  and  comments  on  each 
of  the  Valid  SCALD  tools  from  the  point  of  view  of  the  BRISC  designers. 

S.1.1.  SCALD  Sp«ed 

While  SCALD  comes  a  long  way  in  making  exploratory  hardware  design  as  simple  as 
exploratory  programming,  the  single  biggest  detractor  is  the  speed  of  the  SCALD  system. 
Exploratory  hardware  design  impUes  mteractive  use  of  the  SCALD  system,  which  means  that  fast 
response  is  required  for  most  commands  (i.e.  less  than  one  second),  and  response  time  may  occa. 
sionally  go  up  to  30  seconds.  Studies  have  shown  that  user  productivity  goes  down  when  response 
times  go  over  three  seconds,  probably  because  of  the  disruption  of  user  thought  processes. 


-27- 


Tible  6  lists  s  few  of  the  response  times  encoantered  dorisg  the  BRISC  design.  While  Valid  has 
significantly  improved  the  speed  of  the  post  processor  -  it  once  took  over  six  hours  for  BRISC 
with  606  parts,  and  now  only  takes  12  minntes  with  332  parts  -  the  response  times  need  to  be 
mnch  better  to  allow  true  interactive  design. 


i  Table  6.  Valid  SCALD  Soeed 

FUNCTION 

APPLICATION 

PROGRAM 

MODULE 

iilH 

Compiler 

PCs  &  Address 

3:58 

3:23 

Timing 

BRISC 

23:19 

22:21 

Verification 

Timing 

PCs  St  Address 

2:31 

2:30 

Verifier 

BRISCn 

13:32 

12:19 

Compiler 

PCs  &  Address 

4:13 

3:37 

BRISC 

32:26 

26:56 

Simulation 

Simulator 

PCs  &  Address 

7:10 

(8tart>ap) 

BRISC 

17:59 

Simulator 

PCs  &  Address 

K)4 

(per  cycle) 

BRISC 

:36 

Compiler 

PCs  &  Address 

1:59 

1:28 

Post 

BRISC 

10:19 

9:34 

Processing 

Post  Processor 

PCs  &  Address 

1:40 

1:38 

BRISC 

12:23 

12:19 

ttDoe*  Bo«  iaeUde  PCs  &  Address,  Cache,  Special  Refistcrs,  or  Seselt  Latch. 


In  fairness  to  Valid  it  should  be  mentioned  that  the  Valid  SCALD  system  used  for  BRISC 
runs  on  a  68000.  Valid  also  sells  SCALD  systems  for  VAXs  (with  the  VMS  operating  system)  and 
IBM  3708.  Table  2  may  be  used  to  compare  the  performance  and  cost  of  the  68000,  VAX  and 
IBM  370  (Models  158  and  3081). 


S.1.2.  Graphics  Editor 

The  graphics  editor  proved  to  be  a  useful  tool  for  entering  drawings  and  changing  drawings 
after  they  were  entered.  The  graphics  editor  b  very  fast  for  most  drawings  and  encourages 
interactive  design.  The  graphics  editor’s  interactive  nature  made  it  much  more  useful  than  other 
drawing  tools  previously  available  at  Berkeley. 

CompUer 

The  compiler  converts  drawings  from  the  format  created  by  the  graphics  editor  to  a  format 
usable  by  the  timing  verifier,  simulator  and  post  processor.  As  such,  the  compiler  b  nothing  more 
than  an  intermediary  between  the  graphics  editor  and  the  other  appUcations  as  opposed  to  a 
design  tool. 


-28- 


The  compfler  has  proves  aaeful  for  ftndias  uaertion  violatioM.  Assertion  violations  are  si*- 
nals  that  are  asserted  high  which  are  inadvertently  tied  to  sisnais  that  are  asserted  low.  While 
only  a  few  such  errors  have  been  found  by  the  compiler  in  BRISC,  the  errors  were  detected  faster 
and  corrected  sooner  than  if  they  had  been  found  with  the  simulator  (see  Table  6). 

The  compiler  is  not  designed  for  interactive  use.  A  command  file  must  be  edited  every  time 
a  different  type  of  output  is  required  (i.e.  for  the  timing  verifier,  simulator  or  post  processor),  and 
every  time  a  different  drawing  is  compiled.  The  command  file  supports  batch  versions  of  SCALD, 
and  is  not  appropriate  for  an  interactive  environment. 

S.1.4.  ‘nmlng  Verifier 

The  timing  verifier  uses  minimum  and  maximum  timing  parameters  for  all  parts  in  a  circuit 
to  calculate  a  value  history  of  all  signals  during  a  single  cloci  cycle.  The  timing  verifier  has  two 
uses,  one  is  to  find  timing  violations  such  as  changing  signals  violating  set-up  times,  the  other  is 
to  calculate  the  cycle  time  of  a  circuit. 

Timing  Vlolntloas 

The  timing  verifier  is  useful  when  it  finds  valid  timing  viobtions,  but  is  sometimes  difficult 
to  use  and  is  too  conservative.  To  get  realistic  outputs,  timing  behavior  of  undriven  inputs  to  a 
circuit  must  sometimes  be  specified,  otherwise  all  signals  may  stay  stable  and  no  verification  will 
occur.  Sometimes,  the  timing  verifier  is  so  conservative  that  it  finds  errors  that  could  never  occur 
in  a  real  circuit.  For  example,  the  timing  verifier  once  identified  the  input  to  a  flip-flop  to  be 
changing  before  the  clock,  even  though  the  mput  was  driven  by  the  negative  output  of  the  flip- 
flop  (that  cannot  possibly  change  before  the  clock). 

Some  timing  violations  cause  the  timing  verifier  to  go  into  an  endless  loop,  making  them 
exceedmgly  difficult  to  find.  A  few  timing  violations  have  been  detected  la  BRISC  by  the  timing 
verifier;  mostly  with  the  endless  loop  method. 

Cycle  Time  CsJeulatlon 

The  timing  verifier  does  not  automatically  identify  the  worst  case  path  through  a  circuit,  as 
is  done  by  Crystal.  ‘2  butead,  the  designer  must  examine  the  value  history  of  all  signals  in  a  cir¬ 
cuit  and  find  the  signal  that  is  stable  last.  The  worst  ease  path  is  important  for  two  reasons. 
First,  the  delay  through  the  worst  case  path  must  be  known  to  find  the  minimum  cycle  time  of  a 
circuit.  Second,  the  designer  should  concentrate  on  optimiiing  the  worst  case  path  to  speed  up 
the  circuit,  but  he  needs  to  know  the  worst  ease  path  before  he  can  fix  it.  The  mmimum  cycle 
time  may  also  be  found  by  running  the  timing  verifier  with  successively  shorter  and  shorter  clock 
cycles  until  timing  errors  are  found;  that  still  does  not  identify  the  worst  case  path  and  is  time 


-28- 


consoming. 

Since  large  CPUs  tend  to  spend  about  half  of  their  cycle  times  in  wire  delays,  it  is  essential 
that  wire  delays  be  included  in  timing  rerifier  calculations.  Wire  delays  for  ECL  should  include 
both  delays  due  to  the  length  of  the  wires,  and  delays  induced  by  loads  along  the  wires.  Because 
of  the  speed  of  ECL,  most  wires  act  like  transmission  lines,  and  the  wire  delay  calculator  should 
treat  them  as  such.  As  was  mentioned  in  Section  3.1.1.1,  a  single  pair  of  minimum  and  maximum 
wire  delays  may  be  specified  to  the  verifier,  but  that  is  too  crude  an  approximation.  Altema^ 
tively,  the  timing  verifier  does  have  the  capability  to  accept  calculated  wire  delays  from  a  physi¬ 
cal  design  system,  but  the  Valid  SCALD  system  does  not  include  a  physical  design  system.  Even 
though  a  SCALD  design  may  be  transformed  into  the  input  format  of  another  vendor’s  design 
system,  this  precludes  exploratory  hardware  design  because  it  is  no  longer  interactive;  the  design 
cycle  is  measured  in  days  instead  of  minutes  when  two  design  systems  are  used. 


S.1.5.  Simulator 

The  simulator  does  logic  simulation  of  a  circuit  to  verify  logical  correctness  of  a  design. 
The  simulator  is  potentially  the  most  useful  of  the  SCALD  tools,  but  it  also  takes  the  longest  to 
get  any  resulU  and  to  correct  errors  that  are  found.  The  simulator  has  been  used  to  simulate  the 
operation  of  the  PCt  AND  ADDRESS,  the  ALU  PATH,  and  the  SHIFTER  PATH.  These 
modules  were  picked  for  simulation  because  they  each  do  a  well  defined  logical  function.  The 
entire  BRISC  design  has  been  loaded  into  the  simulator,  but  simulation  has  only  been  partially 
completed  because  of  lack  of  time. 

Once  the  simulator  is  started,  it  is  difficult  to  use  because  every  signal  of  interest  must  be 
identified  for  display,  and  most  inputs  initialized.  A  batch  file  may  be  used  to  identify  and  ini¬ 
tialize  signals,  but  that  seems  to  defeat  the  interactive  capability  of  the  simulator,  and  the  batch 
file  is  just  as  diricult  to  generate  in  the  first  place.  The  simulator  is  generally  used  to  trace  the 
values  of  signals  through  progressive  stages  of  logic.  To  do  this  more  and  more  signals  must  be 
displayed  and  the  circuit  stepped  a  single  cycle  at  a  time  to  trace  the  behavior  of  signals.  When 
an  error  is  found,  the  simulator  must  be  exited,  the  drawing  edited  to  fix  the  error,  the  drawing 
recompiled,  and  the  whole  tedious  simulation  process  restarted  to  get  to  the  same  point.  A  batch 
file  may  be  used  to  return  to  the  same  point  in  the  simulator,  but  if  the  error  was  fixed  many  of 
the  signals  being  displayed  are  probably  no  longer  needed  because  they  were  only  displayed  to 
find  the  error. 

To  provide  exploratory  hardware  design,  the  simulator  should  interact  with  the  graphics 
editor  to  allow  graphical  identification  of  signals  of  interest.  Incremental  recompilation  should  be 
provided  to  allow  rapid  update.  Identification  and  correction  of  errors  should  be  accomplished 
without  leaving  the  simulator.  These  capabUities  are  identical  to  the  debugging  capabilities 


-so- 


provided  by  sood  exploratory  protrammins  environmenta  snch  aa  SmalltaJk-80$.  * 

Even  thoagh  it  is  hard  to  use,  the  simulator  has  fouad  about  a  dozen  errors  in  BRISC  that 
would  have  prevented  correct  operation  of  the  BRISC  CPU. 

Port  Processor 

The  post  procemor  has  been  used  to  find  loading  errors,  undriven  signals,  and  to  count  the 
number  of  ICs  (physical  packaging)  in  the  BRISC  CPU. 

Loadlns  Errors 

The  post  processor  flnds  loading  errors  by  calculating  the  fan-out  of  every  logical  part  and 
flagging  those  parts  that  exceed  fan-out  limits.  This  function  of  the  post  processor  found  about 
20  loading  errors  in  the  BRISC  design,  all  have  since  been  corrected. 

UndrWen  Slcnulu 

The  post  processor  identifies  all  nets  in  a  design  and  flags  nets  that  are  not  driven.  This  is  a 
valuable  function  because  on  a  large  design  such  as  BRISC  it  is  easy  to  assume  that  a  signal  is 
going  to  be  generated  and  then  forget  to  generate  it  or  accidentally  use  a  different  name.  Some 
nets  are  intentionally  not  driven,  such  as  the  reset  input  that  is  assumed  to  be  externally  gen¬ 
erated.  The  post  processor  found  over  15  signals  that  were  inadvertently  left  undriven  in  the  ori¬ 
ginal  BRISC  design,  which  have  since  been  corrected. 

Physloal  Paekagliis 

The  physical  packager  assigns  logical  parts  to  physical  parts  and  gives  a  parts  count  of  each 
type  of  part.  This  is  useful  for  finding  the  cost  of  the  design,  and  to  verify  that  the  design  will  fit 
in  the  allotted  space  (e.g.  a  single  circuit  board).  Exploratory  design  with  the  graphics  editor  and 
physical  packager  reduced  the  BRISC  parts  count  from  606  to  342  for  a  savings  of  44^  as 
described  in  Section  1.2. 

Lessons  Learned 

The  BRISC  project  has  been  instructive  to  the  BRISC  designers  in  two  areas:  computer 
architecture  and  Computer  Aided  Design. 


g  ij  t  refiiterH  trsdemirk  of  Xeras  Carp. 


-31- 


Computer  Arehltaetur* 

BRISC  hu  shown  ns  that  the  RISC  eoncepU  for  computer  design  can  be  successfully 
applied  to  the  design  of  high  performance  computers  for  integer  high  level  language  programs. 
We  have  found  that  it  pays  to  concentrate  effort  on  the  data  path  to  shorten  the  cycle  time.  We 
have  also  found  that  the  smaller  control  implied  by  RISC  helps  to  greatly  reduce  the  parts  count, 
decrease  the  cycle  time  and  decrease  the  design  time.  The  smaller  parts  count  helps  to  keep  the 
cycle  time  down  by  minimizing  both  logic  delays  and  wire  delays.  Furthermore,  the  simple  con¬ 
trol  used  by  BRISC  prevents  pipeline  flushes  so  that  cycles  are  not  wasted. 

Computer  Aided  Desl^ 

We  have  learned  that  designing  computers  without  computer  aided  design  is  like  writing 
programs  without  compaers.  Computer  aided  design  with  SCALD  allows  design  iterations  for 
correcting  design  errors  and  for  tuning  the  design  without  the  expense  and  inconvenience  of  build¬ 
ing  hardware.  The  SCALD  system  by  Valid  has  the  functions  needed  for  such  exploratory  design, 
it  just  needs  more  speed,  fewer  bugs  and  a  physical  design  system.  Even  with  these  few 
weaknesses,  there  is  no  way  we  could  have  done  this  project  without  the  Valid  SCALD  system. 

The  slowne*  of  the  SCALD  system  may  be  partially  due  to  the  speed  of  the  68000  CPU 
used  by  Valid.  SCALD  is  a  good  example  of  an  integer  high  level  language  application  that  could 
be  enhanced  by  a  BRISC  processor. 

S  Future  Work 

The  two  designers  of  BRISC  are  both  interested  in  continuing  the  BRISC  project.  Work 
needs  to  be  done  in  aU  aspects  of  the  project  described  in  this  paper;  hardware  design,  Uming 
verification,  simulation,  and  post  processing.  Once  these  are  completed,  then  BRISC  will  be 
ready  for  fabrication. 

Hardware  Design 

The  hardware  design  for  BRISC  is  complete  except  for  a  couple  of  minor  portions  of  the 
CPU  such  as  a  comparator  to  detect  register-a»-memory  accesses  to  the  register  file.  The  inter¬ 
face  to  the  support  processor  needs  to  be  designed,  and  the  ‘microcode’  must  be  completed  for  all 

instructions. 

Even  though  the  design  is  nearly  complete,  we  would  like  to  make  some  changes  to  the 
design  to  speed  it  up  and  reduce  parts  count.  One  area  that  could  benefit  from  further  redesign  is 
control.  While  the  cunent  implementation  of  control  is  logically  correct,  extra  biU  have  been 
included  for  compatibility  with  old  versions  of  the  microcode  and  to  aid  in  debugging.  Before 


-32- 


fabrication,  eontroi  should  be  reorjaniied  to  remove  uaneeded  bits.  The  resulting  design  should 
be  smaller  than  the  current  design  (by  at  least  five  chips). 

Timing  Verlfleatlon 

We  feel  confident  that  the  worst  case  path  has  been  timing  verified  and  that  we  have  a 
worst  case  estimate  for  the  speed  of  BRISC.  More  timing  verification  is  needed  to  find  timing 
errors.  Furthermore,  accurate  wire  delays  should  be  used  to  get  a  more  accurate  estimate  of  the 
BRISC  cycle  time. 

Simulation 

Simulation  has  been  completed  on  key  portions  of  BRISC,  but  needs  to  be  performed  on  the 
entire  design.  Our  goal  is  to  use  SCALD  to  deposit  programs  into  the  simulated  memory  and 
simulate  BRISC  executing  those  programs.  Such  a  simulation  would  convince  us  that  BRISC  is 
ready  for  fabrication. 

Pont  Processing 

Further  post  processing  will  be  done  to  reduce  the  parts  count  and  to  prepare  the  BRISC 
design  for  fabrication. 

Fabrication 

One  design  goal  for  BRISC  has  been  to  limit  the  parts  count  to  fit  the  BRISC  CPU  onto  a 
single  SI  Mark  HA  wire-wrap  board.  The  SI  wire-wrap  board  is  about  24  by  24  inches  and  has 
space  for  325  lOOK  ECL  ICs  with  associated  termination  resistors  and  bypass  capacitors.  ^  The 
current  332  chip  design  of  BRISC  can  be  modified  to  fit  on  the  SI  board  with  some  minor 

redesign. 

BRISC  could  be  made  faster  by  using  a  denser  board  that  leads  to  shorter  wire  delays.  A 
new  board  being  developed  at  Livermore  for  the  SI  Mark  III  is  such  a  board.  The  Mark  HI  board 
is  18  by  16  inches  and  has  space  for  687  lOOK  ECL  ICs  (on  custom  carriers).  *  The  extra  space 
could  be  used  for  the  cache.  A  disadvantage  of  the  Mark  m  board  is  that  it  must  be  water 
cooled,  so  it  would  only  be  appropriate  if  BRISC  was  imbedded  in  a  larger  system. 

Another  fabrication  technique  would  be  to  redesign  BRISC  using  lOOK  gate  arrays.  The 
resulting  design  would  require  fewer  chips  and  would  be  faster  because  of  shorter  wire  debys,  but 
would  be  more  difficult  to  debug  and  change  than  discrete  logic.  A  discrete  logic  version  of  of  the 
design  could  be  built  as  a  prototype  prior  to  the  gate  array  version  to  debug  the  design. 


•  33 


S.  ConeluiioB 

BRISC  has  shown  that  a  RISC  architecture  developed  for  a  single  cUp  processor  can  be  sue- 
cesafuUy  appUed  to  a  discrete  logic  CPU.  The  resulting  processor  can  be  developed  for  the  cost  of 
a  large  minicomputer  (about  I132K)  and  yet  outperforms  mainframes  costing  mfllions  of  doUars 
on  integer  high  level  language  programs.  In  addition,  the  resulting  processor  can  be  designed 
quickly  because  of  the  simplified  architecture  and  because  of  the  power  of  computer  aided  design 
toob  such  as  SCALD.  As  a  result,  the  RISC  I  architecture  can  serve  as  the  basis  for  a  famUy  of 
processors,  starting  with  medium  performance  single  chip  processors  such  as  RISC  U  and  progress- 
ing  up  to  high  performance  dbcrete  logic  processors  such  as  BRISC.  Our  experience  leads  us  to 
believe  that  the  RISC  style  of  architecture  will  flourish  with  advances  in  density  and  speed  of  new 
implementation  technologies. 


7.  Acknowledgements 

First.  I  must  thank  my  partner  Jeff  Deutsch.  RISC  CPUs  tend  to  make  good  two  person 
projects,  and  Jeff  contributed  much  needed  talento  to  our  team.  He  designed  half  of  BRISC  and 
wrote  half  of  our  report  for  the  SCALD  class  during  Winter  1983.  Parts  of  our  SCALD  report 
have  been  incorporated  into  thb  paper,  including  some  written  by  Jeff.  Many  of  Jeffs  ideas 
appear  throughout  thb  papers,  including  the  multipUcaUon  and  floating  point  enhancemento  to 

BRISC. 

BRISC  would  not  have  been  possible  without  my  advbor,  Dave  Patterson.  He  helped  to 
start  the  RISC  project  at  Berkeley  and  provided  guidance  and  direction  for  both  the  BRISC  pro- 
ject  and  thb  report.  I  would  abo  like  to  thank  my  second  faculty  reader.  John  Ousterhout,  who 
agreed  to  read  thb  report  on  short  notice  even  though  he  had  other  commitmenta  at  the  time. 

I  would  Uke  to  thank  aO  those  involved  with  the  RISC  project  for  making  BRISC  possible. 

I  would  especiaUy  like  to  thank  those  involved  with  the  RISC/E  design  study:  Lloyd  Dickman, 
Scott  Baden.  Jim  Beck.  Paul  Hansen  and  Michael  Shiloh.  Special  thanks  go  to  the  designers  of 
RISC/E.  Tim  Beck  and  Helen  Davb.  Both  Jim  and  Helen  were  very  helpful  m  explaining  the 
RISC/e' design,  and  Heten  was  a  great  help  throughout  the  project  by  patiently  answering  all  of 
my  questions  on  RISC  and  RISC/E.  Many  ideas  throughout  thb  paper  are  due  to  Hebn’s  sugges- 
tions.  including  deletion  of  the  regbter  flle,  redesign  of  regbter  address  cafculation,  and  redesign 
of  the  bussing  structure  by  sharing  multiplexors. 

I  would  abo  like  to  thank  Glen  Miranker  for  explaining  the  SCALD  system,  and  Valid  LogK 
Systems  for  donating  their  equipment  and  time  to  the  EECS  department  at  Berkeley. 

I  am  indebted  to  all  those  who  proofread  thb  paper  (in  alphabetical  order):  my  parents  (Art 
and  Maria  Blomseth).  my  uncle  (Bob  Blomseth),  Helen  Davb,  Pete  Foley,  John  Ousterhout,  and 


-34- 


D»Te  Patteraon.  They  all  provided  excellent  comments,  any  problems  left  in  the  paper  are  my 
own  faolt. 

Most  of  all,  I  would  like  to  thank  my  wife,  Ginny,  and  son,  Kevin,  for  pnttinj  np  with  me 
Soint  off  to  school  and  jivinj  me  emotional  support  alon»  the  way. 


•  35  • 


Referenew 

1  Stooe,  Harold  S.,  Tien  Chi  Chen,  VOchae!  J.  Flynn,  Samuel  H.  FuDer,  WiUiam  G- 

Herschel  H.  Loomis,  Jr.,  WUUam  M.  McKceman,  Kay  B.  Magleby,  RKhard  E.  Kfatick,  ud 
Thomas  M.  Whitney,  Introduction  to  Computer  Arekiteeture,  Science  Research  Associates, 

Inc.. 

2.  BeD,  C.  Gordon,  J.  Craij  Mudge,  and  John  E.  McNamara,  Computer  Enfineerinf  -  A  DEC 
View  of  Hardwire  Syotemo  Design,  Digital  Press,  Bedford,  Massachusetts,  1978. 

3.  aark,  Douglas  W.,  Cache  Performance  in  the  VAX-111780,  Systems  Architecture  Group, 
Digital  Equipment  Corporation,  Tewksbury,  MA  01876,  March  1982. 

4.  DaTidson,  Howard  L„  Personal  Communication,  July,  1983.  Lawrence  Livermore  National 
Laboratory 

5  Dickman,  Lloyd,  Scott  Baden,  Jim  Beck,  Paul  Hansen,  and  h&hael  Shiloh,  RISC/E: 
Extended  Performance  Processor  Design  Study.  Computer  Science  Division,  Department  of 
E£.C.S,  University  of  California,  Berkeley,  CA,  1982. 

6.  Farmwald,  Mkhaei,  Personal  Communie^ion,  July,  1983.  Lawrence  Livermore  National 
Laboratory 

7.  Goldberg,  Adele  and  David  Robson,  Smamalk-80:  The  Language  and  Us  Implementation, 
Addison- Wesley  Publishing  Company,  Menlo  Park,  CA,  1983. 

8.  Johnson,  S.C.,  “A  Portable  CompUer:  Theory  and  Practice,”  in  Proe.  Fifth  Mnuot  ACM 
Symposium  of  Programming  Languages,  pp.  97-104,  Tucson,  Arizona,  January  1978. 

9.  Lampson.  Butler  W.  and  Kenneth  A.  Pier,  “A  Processor  for  a  High-Performance 
Computer,"  in  Proceedings  of  the  Tlh  Annual  Symposium  on  Computer  ArehUecture,  IEEE, 

May  6-8,  1980. 

10.  Mayo,  Robert  N.,  John  K.  Ousterhout,  and  Waltw  S.  Scott,  1883 

Worts  by  the  Original  Artists,  Computer  Science  Division,  Department  of  E£.C.S,  Univer- 
sity  of  CaUfomia,  Berkeley,  CA,  April  12, 1980. 

11  McWilliams,  Thomas  M.  and  Lawrence  C.  Widdoes,  Jr.,  SCALD:  Structured  Computer- 
Aided  Logie  Design,  Computer  Science  Department,  Stanford  University  and  Lawrence 
Livermore  Laboratory,  University  of  California,  1982. 

12.  Miros,  James  C.,  A  C  CompUer  for  RISC  I,  Computer  Science  Division,  Department  of 
E£.C.S,  University  of  California,  Berkeley,  CA,  1982. 

13.  Ousterhout,  John,  Using  Crystal  for  Timing  Analysis,  Computer  Science  Division,  Depart¬ 
ment  of  E.E.C.S,  University  of  California,  Berkeley,  CA,  1983. 

14.  Patterson,  David  A.  and  Carlo  H.  S^uin,  “A  VLSI  RISC,”  Computer,  vol.  15  ,  no.  9,  pp. 
8-21,  September  1982. 

15  Picha  Marianne,  DAPL:  A  General  Purpose  Microprogramming  Assembler  for  U^,  Con^ 

■  puter’ Science  Division,  Department  of  E£.C.S,  University  of  California,  Berkeley,  CA, 

October  2,  1980. 

16.  Pier,  Kenneth  A.,  “A  Retrospective  on  the  Dorado,  A  High-Performance  Per^nal  ^ 
puter,”  in  Proceedings  of  the  10th  Annual  Symposium  on  Computer  Architecture,  IEEE, 

1983. 

17.  Pier,  Kenneth  A.,  Personal  Communication,  July,  1983.  Xerox  Corporation 

18.  Ponder,  Carl,  ‘...but  wUl  RISC  run  LlSPfV  (a  feasibUity  study),  Computer  Science  Division, 
Department  of  E.E.C.S,  University  of  California,  Berkeley,  CA,  April  8,  1983. 

19.  SeUnger,  Robert  David,  The  Design  and  Evaluation  of  Single  and  Multiple  Processor  O^ce 
Workstations,  PhD  Dissertation,  Computer  Science  Division,  Department  of  EJ:..o.:>, 
University  of  California,  Berkeley,  CA,  June  9,  1983. 


.36. 


20.  Shell,  Besa,  “EaTiroomento  for  Exploratory  Progranimin*,"  Datamation,  February  1983. 

21.  Smith,  Alan  Jay,  “Cache  Memoriea,”  Computing  Survey,  toL  14,  no.  3,  pp.  473  •  530,  Sep- 
tember  1982. 

22.  Son  Microsystems,  Inc.,  5ttn  Workstation  Arehiteeiure,  Son  Microsystems,  Inc.,  Mountain 
View,  CA,  Feb  1983. 

23.  Sun  Microsystems,  Ine.,  U.S.  and  Canadian  Price  LUt,  Sun  Mkroeystems,  Inc.,  Mountain 
View,  CA,  April  15,  1983. 

24.  Thadhani,  A.  J.,  “InteractiTe  User  ProductiTity,”  IBM  Systems  Journal,  toI.  20,  no.  4,  1981. 

25.  UC  Berkeley,  RISC  I  Principles  of  Operation,  Computer  Science  Dieision,  Department  of 
E.E.C.S,  UniTersity  of  California,  Berkeley,  CA,  Oct  1981.  draft 


appendix  a.  brisc  instruction  set 


Appendix  A 

BRISC  Itutrueiion  Set 


Instr. 

Comments 

ADD 

ADDC 

SUB 

SUBC 

RSUB 

RSUBC 

AND 

OR 

XOR 

SLL 

SRL 

SRA 

BUS 

Bli; 

D  •-  SI  +  S2  integer  sdd 

D  SI  +  S2  +  carry  »dd  with  carry 

D  ^  SI  -  S2  integer  subtract 

D  ♦-  SI  -  S2  -  carry  subtract  with  cany 

D  ^  S2  -  SI  reverse  integer  subtract 

D  S2  -  SI  -  carry  subtract  with  carry 

D  SI  &  S2  logical  AND 

D  ■«-  SI  I  S2  logical  OR 

D  ^  SI  xof  S2  logical  EXCLUSIVE  OR 

D  SI  shifted  by  S2  shift  left  logical 

D  SI  shifted  by  S2  shift  right  logical 

D SI  shifted  bT  S2  shift  right  arithmetic 

LOADW 

LOADWR 

LOADWS 

LOADHU 

LOADHUR 

LOADHUS 

LOADHS 

LOADHSR 

LOADHSS 

LOADBU 

LOADBUR 

LOADBUS 

LOADBS 

LOADBSR 

LOADBSS 

LOADIH 

LOADIR 

STOREW 

STOREWR 

STOREWS 

STOREH 

STOREHR 

STOREHS 

STORED 

STOREBR 

STOREBS 

S1(S2P 

#L.D 

S1(S2)D 

S1(S2)D 

#L.D 

S1(S2)D 

S1(S2)D 

#L,D 

S1(S2)D 

S1(S2)D 

#L.D 

S1(S2)D 

S1{S2),D 

#L,D 

S1(S2)J) 

#L,D 

#L,D 

SD1(D2) 

S,#L 

SD1P2) 

SD1(D2) 

S.#L 

SD1(D2) 

SJ51(D2) 

S.#L 

SD1(D2) 

D  «-  M(S1+  S2|  load  word 

D  ♦-  Mjpc+  #Ll  load  word  relative 

D  m|s1+  S2|  load  word  special 

D  ♦-  M|S1+  S2|  load  halfword  unsigned 

D  Mjpc+  #Ll  load  halfword  unsigned  relative 

D  •“  M(S1+  S2|  load  halfword  unsigned  special 

D  «-  M(S1+  S2j  load  halfword  signed 

D  «-  M(pc+  #Ll  load  halfword  signed  relative 

D  M(S1+  S2j  load  halfword  signed  special 

D  M(S1+  32]  load  byte  unsigned 

D  M(pc+  #LI  load  byte  unsigned  relative 

D  «-  MIS1+  S2l  load  byte  unsigned  special 

D  -  M|S1+  S2]  load  byte  signed 

D  —  Mipc+  #Li  load  byte  signed  relative 

D  —  M1S1+  S2}  load  byte  signed  special 

D<31:13>— #L;  D<12.-0>-^0  load  immediate  high 

D  «-  M|pc+  #Ll  load  immediate  relative 

M1D1+  D2l  —  S  »tore  word 

M{pc+#L1-S 

MIDI+  D2l  *-  S  »tore  word  special 

MID1+  D2l  S  ‘tore  halfword 

Mjpc+  #L1  S 

MID1+  D2l  S  »tore  halfword  special 

MID1+  D2l  ^  S  *tore  byte 

M|pc+  #L1  S 

M|Dl+  D2!  *-  S  store  byte  special 

Appendix  A 

BRISC  hutritetion  Set  (continued) 


Instr. 

JMP 

JMPR 

CALL 


RET 

TRAP 


C0ND,S1(S2) 

COND,#L 

D,S1(S2) 


CALLR  D,#L 


S1(S2) 

D 


CALLINT  D.#V 


RETINT  S 


GETCPC  D 
GETLPC  D 
GETCWP  D 
GETSWP  D 
GETPSW  D 
PUTCWP  S 
PUTSWP  S 
PUTPSW  S 


next  pc  *-  S1+  S2 
pc  ♦-  next  pc  +  #L 
D  ♦-  pc; 

next  pc  *—  S1+  S2,  CWP— 

D  pc; 

next  pc  ♦-  pc+  #L,  CWP— 
PC-S1+  S2,  CWP+  + 

D  «-  pc  ; 

next  pc  ■*-  trap  Tector,  CWP— 
D  last  pc; 
next  pc  •*-  #V; 
disable  interrnpto 
pc  S; 

CWP+  +  ; 

enable  intefmpU _ 

D-pc 
D  4-  last  pc 
D  4-  CWP 
D  —  SWP 
D  4-PSW 
CWP  4-  S 
SWP  4-  S 

PSW-H-S  _ _ 


Comments  _ 


conditional  jump 
conditional  jump  relative 
call 


call  relative 

retnra 

trap 

hardware  intermpt 


return  from  interrupt 


set  cnrrent  pc 
set  last  pc 

set  current  window  pointer 
set  saved  window  pointer 
load  status  word 
put  new  current  window  pointer 
put  new  saved  window  pointer 
pot  new  status  word _ _ 


(1)  S,  SI.  and  SS  are  source  resisters  and  specifiy  one 

S2  may  also  be  a  13-bit  immediate  value  specified  as  ^CONSTAyT. 

('’1  D  Dl,  and  Dt  are  destination  resisters  and  specifiy  one  of  RO  throush  RSI.  D2  may  also 
be  a  13-bit  immediate  value  specified  as  ^COySTAST. 


(3)  #L  is  a  19-bit  immediate  value. 

(4)  #  V  is  a  hardware  senerated  interrupt  vector. 

(5)  COyO  is  a  jump  condition  and  specifies  one  of: 

none  jump  always  (unconditional  jump) 
ne  jump  if  not  equal 

eq  jump  if  equal 

nc  jump  if  no  carry 

e  jump  if  carry 

no  jump  if  no  overflow 

0  jump  if  overflow 

It  jump  if  less  than 

le  jump  if  not  sreater  than 

ft  jump  if  sinter  than 

fe  jump  if  not  less  than 

lot  jump  if  lower  or  same 

ki  jump  if  hisher 

p  jump  if  plus  (positive) 

m  jump  if  minus 

no  jump  never 


appendkb.  commercial  cache  sxjmmary 


NoU  1;  WiH*  bMk. 

Note  2:  Write  thraock. 

Note  S:  Hit  ratio  ot«t  98^  m*a«ar«d  ia  ««t  ido4o,  otw  96%  ia  iaporriaor  aio<l#. 

Note  i:  4-byte  data  path  b€t»e«a  CPU  aad  eicko. 

Note  8:  Hit  ratio  o»«r  99%  meaiared. 

Note  6;  No  writ*  alJoeate. 

Note  7;  No  reorder. 

Note  8:  8-byte  data  path  betweea  CPU  aad  cackc. 

Note  9:  2-byte  data  path  betweea  cache  aad  aacBoiy. 

Note  10:  4-byte  data  path  bet-aeea  cache  aad  aetDoty. 

Note  U:  LRU  alloeatioa. 

Note  12:  Modified  LRU  alloeatioa. 

Note  13:  tS%-90%  hH  ratioa  meaamred  (*Hh  M  to  40  leera  aetire).  96%-99%  hit  ratice  caleilated. 

Note  14:  Petch  bypaae. 

Note  IS:  Prefetch. 

Note  18:  Piefetech  oa  mi». 

Note  17:  Virtaal  addren  cache. 

Note  18:  Miltiproceeaor  ampport  uiac  cache  cohenaee. 

Note  IS:  Oae  write  bifer. 


APPENDIX  C.  EFFECTIVE  CYCLE  TIME  CALCULATION 


Appendix  C 

Effective  Cyc/e  Time  Caievlation 

PROCESSOR 

EFFECTIVE  CYCLE  TIME  (ns)! 

100%  cache  hit 

00%  cache  hit 

BRISC  A 

400 

37 

73 

800 

113 

1200 

153 

BRISCB 

400 

40 

81 

800 

121 

1200 

161 

BRISC  C 

400 

63 

07 

800 

137 

1200 

177 

0.17  two  TiJ««  for  kit  ntioi  ut  fkow.  bcciM  kH  rUio  toum  effeetrr.  c7cfe  tim.  it  t  li.e»r  fMctiom. 
tiom  itei  to  c»k»lxte<i  tke  tffoctiTO  cydo  tim#  ir 

.tfectW*  eyd#  tim#  -  (kH  mtioXcp#  <7<i*  *>«•)  +  (»  -  ti®*) 


Tke  feme- 


appendix  D,  BRISC  PARTS  LIST 


Appendix  D 
BRISC  PhtU  Lift 


- - - 

PART 

NUMBER 

— 

100-1000  PRICE  1 

FO> 

YER 

quantity 

t  EACH  1  TOTAL  t 

lEACH 

TOTAL  $ 

mA  ElACH 

TOTAL  mA 

100101 

100102 

100107 

100122 

100136 

100141 

100150 

100151 

100155 

100158 

100171 

100179 

100181 

■trmAOO 

6 

86 

3 

13 

16 

4 

47 

10 

58 

16 

25 

1 

8 

39 

5.08 

5.08 

7.07 

6.58 

24.06 

12.15 

12.15 

12.90 

15.50 

19.34 

12.65 

15.50 

25.54 

32.24 

30.48 

438.88 

21.21 

72.54 

384.96 

48.60 

571.05 

129.00 

899.00 

309.44 

316.25 

15.50 

204.32 

1257.38 

3.61 

3.61 
5.02 
3.96 

17.07 

8.62 
8.62 
9.15 

11.00 

13.73 

8.98 

11.00 

18.13 

22.80 

21.66 

310.46 

15.06 

51.48 
273.12 

34.48 
405.14 

91.50 

638.00 

219.68 

224.50 

11.00 

145.04 

889.20 

38 

80 

96 

96 

283 

238 

159 

210 

133 

205 

114 

220 

300 

200 

228 

6880 

288 

1248 

4528 

952 

7473 

2100 

7714 

3280 

2850 

220 

2400 

7800 

TOTAL 

332 

$4696.59 

$3330.32 

49,753 

Nou:  Pric-  qmoud  J.ly  !«»  »»y  JUnuho.-AT.rt  for  FurckM  p«U.  Power  iom  .*•  fro*  ii*  F»irokild  PlOOK  BCL  — 
CAW  Boo*  occepi  for  tke  10<M22,  wkiek  uoi  tko  Fijitim  10<M22A-7  power  lc.r»- 


appendece.  brisc  drawings 


0 


6 


PltRSC  3  imCM 


4  l.AICH 


S  LMTCI1 


CROC  Dtt  OUTOl.  .  24.  13> 


CLOCK  CtCN  tCS'ie  *  \G 


aoCK  OOO  lCO-5  •  \G 


PI4RSC  3  LATCH  OUTcB.  .il> 


«"  Gccoiic; 


CRH  SCL  \C:PSH  ISCL  \0 
CWH-  StL«l..e»  \G;SWP  SCL  \G 


RCG  AS  rtiH  cma  c  so 


9t  S«a.  .0>sG 


ALU  rN4  3.  .0>\G 
SYS  CL«4t  f  Nt2.  .B>  sC 
lisM  CY4^ii  rN<2.  .e>  \g 
cwPi  rN<  1. .  0>\c 
fit's  sei.ti.  .0>  vG 
NC  \R  I 
NC  \R  I 


HC  \R  3  t  SBAH<2.  .B>vG 


_  CAL  LE*  vG 
_  lORO  PC*«  \G 


PCCb  TO  CUUS  OC*  \Q 

Ntc  AS  frn  CNABi  c«  v 


HLS  LRia«.0«  \C 
AI.U  GAHRYIH*  sG 


fC  CROC*  \G 
CPL  It*  nG 


[_  rt.U  OUlLRIOlft  \0 


CHK  0t»  vG 
SHp  oe«  \G 


NOT  lltiCD*  tHTSULT  or*  \G 

PSH  or*  \u 


PSU  SYSI  LRCSLORU*  vG 
PSU  PliJI  LRGSiORO*  SC 


_  SYS  CV*>L  IE*  sG 
_  U5R  CY44.  It*  \G 


_ SYS  CWP»1  Ot*  SO 

L_  ifiR  or*  nC 


SHP  STS  It*  sG 
SHH  USH  1C*  SG 


PIPTLINC 


'IL  PlVlSC  3  1C* 


COHTRa. 


LRST-HODiriCD^nun  n*ij  8  13:30:47  1803 


I  ERfilHCEHj - 

I  RICH  a  OnSCTH  RNO  JCfFRCr  T.  OCUTSO* 


6 


3 


2 


6 


17P 


HC  NR  I 
NC  SR  1 


^<.5TEP=5IZE  mSDCO 


LRST_HOOiriED=Wod  Jun  29  18:48:90  1903 


TITLE; 

DRTE: 

3  OCCOUC  RAM 

a/piz-aa 

CWCTFTCff:' 

JEFFREY  T.  DEUTSCH  Mto  RICH  BLOHSETH 

1  or  i 

6 


S 


6 


S 


3 


2 


X 


X 


3kl0 

^  4  DCCOCe  aJT<29.  .0>  si 


I 


BEXltt. 

M  riRsi -0 
X_SItP-5I2E 


.uufiuir^L 

RtASC  4  DtCOLC  fVtfl 
ph4«]cd 

LASf-iiOUiritO=UuU  Jun  29  16:47:40  1903 


A 


riFLC: 

Df\T€i 

4  DCCODC  RAM 

EHCTFCCRs 

PAGC: 

Hiai  eLOMS£TH 

3/22/63 

1  of  1 


3 


2 


6 


C^C  DB  aJTilfl.  .13>  M 


if 

13(> 


CAOC  pa  OUT<  I2»  M 


IML.  SllOWT  InrtD  \G 


im  i.or«c>  imitOv  sG 


irt.  U«  vG 


— O- 


JJtf-liJL 
K  .r  l«‘jT=0 

K.aiLM-SIZt 


6 


7T  2P 


LlTflVt.  ORIA  I  AfCH 


L«ST_r«ouiritii^bot  my 


7  23:54:  31  1903 


TITLCi 

tRGiNrrffT 


DftTf: 


LI 


TCM«.  DATA  LATCH 


PAG£: 


RICH  a.OHSLIM 


2/27/63 


I  of  I 


2 


1 


S8  SB 


uErihC 

x_riR3T=a 

X_STCP-3UE 


.i£6HlttL 

RCGISTCR  »4JriBC:R  t 


rniMp 

L«sr_iiooif  ieo=s» 


T 


6 


5 


a 


6 


D 


b 


CI  OCK  CCtN  fCS-ie  •  \G  \R  5 


&a  t5F 

QLbi  ujriac.R<4.  .a>  \i 


CLOCK  ODD  lC0'S  ¥  VG 

- 


CLOCK  QUO  <CB-S  «  \G  vR  5 


CLOCK  ODD  ice-a  «  \G  vR  5 


SRC  2  f  OfK4r»4U«  \C 


OlltR  RtG  Miri6CR<4.  .e>»  M 


6 


2 


J _ L 

2T 

IB 


Rf  UC  RCG  NlJMBt:R<4>  M 


axii£- 

K.riRST-O 

K  SltP=SlZC 


iCGISTtR  h4tieeR  IM^T  LArOI 
rit  t  I 

Lnai-riOOiritD=Mon  I1*y  a  182  42:  42  11183 


nil  f: 

ORTCs 

RCCisrcR  MjnatR 'input  lrtch 

2/24/83 

rwifctni 

PRG£: 

1  Of  1 

Riat  ttCnsETH 

T 


a! 


3 


2 


5 


3 


L 


IB 

10P 


Dtrirc 

X_riH3T=« 

X-STtP=SIZE 


CBfMirfi- 

flLUPATH 

alupalh 

tftST  _rlODIFtED=rr « 


M«u  i3  IBs  If:  la  1963 


Tiruc: 

□Art: 

aiupath 

2/27/03 

ENGIhCCR: 

WGEs  .  .  . 

T.  t)«ui9ch 

1  of  1 

"  J _  =  .J _ ! _ 

tJ«l«i31>vl  ~z  ttlu  •igovG 


X 


L«sr^nooiritD=bo(i  noa  b  00:00;  09  i903 


■  CSriNt; 
X_FIRST=0 
X_STEP=SIZE 


DRAINING 

TITLE=32  BIT  REUERSER 
ABBREU=32BREU 


LAST_MODiriED=Sun  May  8  00;  28: 29  1983 


Ittffray  T.  tittutscli 


ioim 


6 


S 


OTltSL  BRPHINfi- 

x-riR3T=e  spcci«-  peoisirRS 

SPCCREGS 

:  <%rri3=si7r 

LA51 -MODiriEb=rhu  ri«v  i2  11;4S;3B  1963 


o 


f  2 

• 

4 

f 


LAST.MODiriCD^Mtin  H«y  S  16:  14:35  1^3 


J 


D0<4R0W/ CARRY  M 


OUT(2.  .  M 


HC  sR  I 

I  ~ 


1_. 


our'&<2. 


D> 


\l 


m-fiHL  UBauiufi 

K_riHbT=a  RDObUB 

K-SIIP-Sl^C;  »auaub 

LASi-nooirico-sun  nau  a  iw:  42:  3i  1003 


TirtC; 

date:; 

3/15/03 

DWIrrCR: 

f*RGe:  ^  T 

Jttffrtti)  T.  D«ut9ch/'Rtcti  Olonsath 

1  at 

3 


2 


[ _ ! _ J _ ! _ J 


C 


ecV) 


32B 

3P 


our<ai.  .B>vi 


A 


ULfUl^ 

M_riRtif-e 

M_SltP-5Ufc 


Ntsm  T  I^TQI 
RLATQH 


i.AST-nouirito=^rtjn  rioy 


lirtc: 

Rooult  EotcVt 

cf«I^erB: 

Jaffroy  T.  bautoch 

9  15:  34:23  1QQ3 


DATE: 

3/i7/B3 
Pt\(£i 


i  of  i 


6 


S 


1 


X 


6 


S 


I 


0<31..24>  M 


0(23.  .  16>  vl 


0(  is.  .e>  \i 


0(7. .  e> 


.CRnHlMG 

X-riRSr=«  32  BIT  RtGISTER 

X_STEP=S12E  32BR 

LftST_MODiriED=Sijn  Mag  6  00;  46:  33  1963 


32  Bi!t  register 


CNGIKTEH"" 


aicti  BLOMSETH  JEFF  DEUTSCM 


3 


2 


16  eiT  COUNTCR 


ERCINCCR: . .  ^  ^ 

RICH  BLOriSCTH  arid  Jaff  Dautach 


1  or  1 


2 


appendix  7.  DAPL  MICROCODE  LISTING 


••  RISC/E  n  Phase  3  Decode  RAM  microcode. 

«« 

••  4/18/83 

mm 

DESCRIPTION 

MICROMEMORY.IS  256  WORDS_BY  52  BITS; 

INTERLIST  HEX; 

NUMBER_3ITS  HIGHTOLOW; 

DEFINE  true  AS  1.; 

DEFINE  false  AS  0.; 

FIELDS  pclbToRbosOE  WIDTH  1  DEFAULT  0., 
pccbToRbusOE  WIDTH  1  DEFAULT  0., 
imlShortlmmed  WIDTH  1  DEFAULT  0., 
imELoBRlmmed  WIDTH  1  DEFAULT  0., 
rageuRset  WIDTH  1  DEFAULT  0., 
sregSel  WIDTH  1  DEFAULT  0.. 

caUiOpcode  WIDTH  1  DEFAULT  0.. 
fopvafdfji^ble  WIDTH  1  DEFAULT  l.« 
pipeCoiitrol2  WIDTH  1  DEFAULT  0. 

WITH  (phase6Write  —  0., 

BoPhasedWrite  ■■  1.), 
pipeCoBtrollO  WIDTH  2  DEFAULT  0. 

WITH  (Bormal  —  0., 

suspeBdl  —  2.,  (•  aligBed  load  &  store  •) 
8U8pead2  -■  3,),  (•  UBaligBed  load  &  store  •) 
8pare2  WIDTH  2  DEFAULT  0., 

shlael  WIDTH  2  DEFAULT  0. 

WITH  (shIRbus  —  0,, 
shlRfO  ->  1.. 
dilRfl  —  2., 


ahllm 

-3.), 

shASel 

WIDTH  2  DEFAULT  1. 

WITH  (shARbus  —  0., 

shARn  —  1., 

shAIm  ■■  2.), 

8pare3 

WIDTH  1  DEFAULT  0., 

sreglE 

WIDTH  1  DEFAULT  0., 

aluAIE 

WIDTH  1  DEFAULT  L, 

aluBIE 

WIDTH  1  DEFAULT  1., 

shILE 

WIDTH  1  DEFAULT  1-, 

shAE 

WIDTH  1  DEFAULT  1., 

thIE 

WIDTH  1  DEFAULT  1., 

shSigBed 

WIDTH  1  DEFAULT  0., 

spared 

WIDTHS  DEFAULT  0., 

•«  The  foUowiag  sigaals  are  latched  by  the  Phase  4  latch  for 
••  use  duriag  phase  4. 


8hS  WIDTHS  DEFAULTS.,  (•«?«•) 

shBisht  WIDTH  1  DEFAULT  0., 

aluFn  WIDTH  4  DEFAULTS. 

WITH  (aluBcdAdd  —  S., 
aluBcdSab  «  1., 
aladBcdSubI  —  2., 
aloBcdNesB  "  3., 
aloAdd  «  4., 
aiaSab  «  5., 
aluSabl  «  6., 
liaNcfB  *  7., 
alaXuor  8., 
alaXor  «  9., 
aiuOr  s"  IS., 
aloA  »  11., 
sluNotB  »  12., 
aluB  »  13., 
aloAsd  ^  14., 
aioLow  »  15.), 

sysCwphFtt  WIDTH  3  DEFAULT  7. 

WITH  (sysCwphLoad  —  S., 
sysCwphDeer  «■  4., 
sysCwphClear  “  5., 
sysCwphlacr  ■■  6., 
sysCwphHold  ■■  7.), 

uwCwphFa  WIDTH  3  DEFAULT  7. 

WITH  (oarCwphLoad  ■■  S., 
osrCwpbDecr  **  4., 
asrCwphClear  »  5., 
uflrCwpblacT  “  6., 
aarCwphHold  =■  7.), 

curpIFn  WIDTH  2  DEFAULT  2. 

WITH  (c-wpllncr  ■■  1., 
curpIHold  *  2., 
cwpID«cr  »■  3.), 

r«S«l  WIDTH  2  DEFAUXT  a 

WITH  (nwnltAIa  —  S., 

RsaltCboa  ■■  1., 
reaaltRfl  ■■  2., 
resaltSbift  *■  3.), 

reaLatchLdWIDTH  1  DEFAULT  1., 
aiuCarryla  WIDTH  1  DEFAULT  S., 
spares  WIDTH  1  DEF.\ULT  S.; 

MICROPROGRAM 

(•  NOP  •) 

aoPhaseSWrite; 

BoPhaaeSWrite; 

BoPhaaeS  Write; 

BoPhaaeSWrite; 

(•  ADD  •) 

aluAdd,  resaltAlu; 
alaAdd,  result  Ala; 
aluAdd,  resaltAlu; 


aloAdd,  resnltAla; 

(•  ADDC  •) 

aloAdd,  resaitAla; 

&iiiAdd,  resaitAla; 
mlaAdd,  resaitAla; 
olaAdd,  resaitAla; 

(•  SUB  •) 

aluSub,  resaitAla; 
siaSabl,  resaitAla; 
alaSab,  resaitAla; 
alaSabl,  resaitAla; 

(•  SUBC  •) 

alaSab,  resaitAla; 
alaSabI,  resaitAla; 
alaSab,  resaitAla; 
alaSttbl,  resaitAla; 

(•AND*) 

alaAnd,  resaitAla; 
alaAnd,  resaitAla; 
alaAnd,  resaitAla; 
alaAnd,  resaitAla; 

(•  OR  •) 

alaOr,  resaitAla; 
alaOr,  resaitAla; 
alaOr,  resaitAla; 
alaOr,  resaitAla; 

{•  XOR  •) 

aloXor,  resaitAla; 
aloXor,  resaitAla; 
alaXor,  resaitAla; 
aloXor,  resaitAla; 

(•  SRL  •) 

shIRfO,  shARn,  shRisht 
shlRfO,  shARfl,  shRisbt 
shIRfO,  sbARfl,  shRight 
shIRfD,  shARfl,  shRight 

(•  SRA  •) 

shmiO,  shARfl,  shRight 
shlRfO,  shARfl,  shRight 
shlRfO,  shARfl,  shRight 
shlRlD,  shARfl,  shRight 

(•  SLX  •) 

shIRfO,  shARfl,  shRight 
shlRfO,  shARn,  shRight 
ShIRfO,  shARn,  shRight 
ShIRfO,  shARn,  shRight 


trae,  shSigned  ’ 
trae,  shSigned  = 
trae,  shSigned  < 
trae,  shSigned  < 

trae,  shSigned  ' 
tme,  shSigned  ^ 
trae,  shSigned 
trae,  shSigned 

false,  shSigned 
false,  shSigned 
false,  shSigned 
false,  shSigned 


false,  resoltShift; 
false,  resoltShift; 
false,  resoltShift; 
false,  resoltShift; 

trae,  resoltShift; 
trae,  resoltShift; 
trae,  resoltShift; 
trae,  resoltShift; 

>  false,  resoltShift; 
I  false,  resoltShift; 

>  false,  resoltShift; 

>  false,  resoltShift; 


••  RISC/E  n  Phase  4  Decode  RAM  microcode. 

•« 

•e  4/17/83 

«« 


DESCRIPTION 

MICROMEMORY_IS  128  WORDS^Y  30  BITS; 
INTERLIST  HEX; 

NUMBER.BITS  HIGHTOLOW; 


DEFINE  true  AS  1.; 

DEFINE  false  .\S  0.; 

FIELDS  sparel  WIDTH  3  DEFAULT  0., 

camSel  WIDTH  1  DEFAULT  0. 

WITH  (camPC  -  0.. 
camRbas  ss  1.), 

pswisel  WIDTH  1  DEF.\ULT  0. 

WITH  (pswSys  —  0., 
pswin  1.), 

curplSel  WIDTH  2  DEFAULT  0. 

WITH  (cirpUa  -  0., 
cwplSys  =■  2., 
cwplUar  3.), 

jwpSel  WIDTH  1  DEFAULT  0. 

WITH  (awpSya  -  0., 
swpUar  ■■  1.), 

calLE  WIDTH  1  DEFAULT  0., 

loadPC  WIDTH  1  DEFAULT  0., 

pccbToCbusOE  WIDTH  1  DEFAULT  0., 
re^AsMemEaable  WIDTH  1  DEFAULT  0., 
weCache  WIDTH  1  DEFAULT  0., 

cdllE  WIDTH  1  DEFAULT  0., 
csResfile  WIDTH  1  DEFAULT  1., 
abOLE  WIDTH  1  DEF.\UTT  0., 

alaOutLatch  WIDTH  1  DEFAUTT  1., 

(TwplOE  WIDTH  1  DEFAULT  0., 

swpOE  WIDTH  1  DEFAULT  0., 

8pare2  WIDTH  1  DEFAULT  0., 

resaltOE  WIDTH  1  DEFAULT  1., 
pswOE  WIDTH  1  DEFAULT  0., 

pawSyaFIagsLoad  W/IDTH  1  DEFAULT  0,, 
pswAluFlagaLoad  WIDTH  1  DEF AULT  0,, 
syaCwplIE  WIDTH  1  DEIFAULT  0., 
asrCwplIE  WIDTH  1  DEFAULT  0., 
syaCwphOE  WIDTH  1  DEFAULT  0., 

■arCwphOE  WIDTH  1  DEFAULT  0.. 

iwpSyalE  WIDTH  1  DEFAULT  0., 

■wpUarlE  WIDTH  1  DEFAULT  0.; 


MICROPROGRAM 
(•  NOP  •) 


(•  ADD  •) 

regABMemEnable  —  true,  alnOatLatch  —  true; 
regAsMemEnable  —  true,  iluOatLatch  —  tnie; 

(•  ADDC  •) 

rejAaMemEnabl*  =“  true,  iluOutLatch  *“  tree; 
resAsMemEnable  —  tree,  «luOntLatch  *  tiue; 

(•SUB*) 

rejAaMemEnable  ■■  true,  alnOntLatch  true; 
regAsMemEaable  «  true,  iJuOntLatch  “  true; 

(•  SUBC  •) 

re»AiMemEnable  —  true,  alnOutLutch  —  true; 
rejAsMemEnable  «■  true,  uluOutLatch  =■  true; 

(•AND*) 

retAsMemEnable  “  true,  iluOutLatch  *  true; 
re^AaMemEnable  «■  true,  aluOutLatch  “  true; 

(•  OR  •) 

regAsMemEnable  »  true,  aluOutLatch  “  true; 
rejAsMemEuable  true,  aluOutLatch  “  true; 

(•  XOR  •) 

resAsMemEnable  -■  true,  aluOutLatch  —  true; 
rejAaMemEuable  —  true,  aluOutLatch  —  true; 

(•  SRL  •) 

retAsMemEuable  “  true,  ah  OLE  “  true; 
regAsMemEnable  true,  shOLE  “  true; 

(•  SRA  •) 

rejAsMemEnable  ™  true,  shOLE  *  true, 
regAsMemEnable  ■■  true,  shOLE  “  true; 

(•  SLX  •) 

regAsMemEnable  “  true,  shOLE  true; 
regAsMemEnable  *  true,  shOLE  =■  true; 

(•  LDHI  •) 

regAsMemEnable  ■■  false,  shOLE  “  true; 
regAsMemEnable  ™  false,  shOLE  ■“  true; 

(•  CALLX  •)  ^  ^  . 

regAsMemEnable  -■  true,  aluOutLatch  —  true,  shOLE  —  true, 
regAsMemEnable  —  true,  aluOutLatch  »■  true,  shOLE  —  true; 


